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Preface 


This book is drawn from the notes that I developed to teach AEB 6571 — 
Econometric Methods I in the Food and Resource Economics Department at 
the University of Florida from 2002 through 2010. The goal of this course was 
to cover the basics of statistical inference in support of a subsequent course 
on classical introductory econometrics. One of the challenges in teaching a 
course like this is the previous courses that students have taken in statistics 
and econometrics. Specifically, it is my experience that most introductory 
courses take on a cookbook flavor. If you have this set of data and want to 
analyze that concept — apply this technique. The difficulty in this course is to 
motivate the why. The course is loosely based on two courses that I took at 
Purdue University (Econ 670 and Econ 671). 

While I was finishing this book, I discovered a book titled The Lady Tast- 
ing Tea: How Statistics Revolutionized Science in the Twentieth Century by 
David Salsburg. I would recommend any instructor assign this book as a com- 
panion text. It includes numerous pithy stories about the formal development 
of statistics that add to numerical discussion in this textbook. One of the 
important concepts introduced in The Lady Tasting Tea is the debate over 
the meaning of probability. The book also provides interesting insight into 
statisticians as real people. For example, William Sealy Gosset was a statisti- 
cian who developed the Student’s t distribution under the name Student while 
working in his day job with Guiness. 

Another feature of the book is the introduction of symbolic programs Max- 
ima and Mathematica™ in Appendix A. These programs can be used to re- 
duce the cost of the mathematical and numerical complexity of some of the 
formulations in the textbook. In addition, I typically like to teach this course 
as a “numbers” course. Over the years I have used two programs in the class- 
room — Gauss'™ by Aptech and R, an open-source product of the R-Project. 
In general, I prefer the numerical precision in Gauss. However, to use Gauss 
efficiently you need several libraries (i.e., CO — Constrained Optimization). 
In addition, Gauss is proprietary. I can typically make the code available to 
students through Gauss-Lite based on my license. The alternative is R, which 
is open-source, but has a little less precision. The difficulties in precision are 
elevated in the solve() command for the inverse. In this textbook, I have given 
the R code for a couple of applications. 

Of course, writing a book is seldom a solitary enterprise. It behooves me 
to recognize several individuals who contributed to the textbook in a variety 
of ways. First, I would like to thank professors who taught my econometric 
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courses over the years, including Paul Beaumont of Florida State University, 
who taught Econ 670 and Econ 671 at Purdue University; James Binkley, 
who taught Ag Econ 650 at Purdue; and Wade Brorson of Oklahoma State 
University, who taught Ag Econ 651 at Purdue University during my time 
there. I don’t think that any of these professors would have pegged me to 
write this book. In addition, I would like to thank Scott Shonkwiler for our 
collaboration in my early years at the University of Florida. This collaboration 
included our work on the inverse hyperbolic sine transformation to normality. 
I also would like to thank the students who suffered through AEB 6571 — 
Econometrics Methods I at the University of Florida. Several, including Cody 
Dahl, Grigorios Livanis, Diwash Neupane, Matthew Salois, and Dong Hee 
Suh, have provided useful comments during the writing process. And in a 
strange way, I would like to thank Thomas Spreen, who assigned me to teach 
this course when he was the Food and Resource Economics Department’s 
graduate coordinator. I can honestly say that this is not a course that I would 
have volunteered to teach. However, I benefitted significantly from the effort. 
My econometric skills have become sharper because of the assignment. 

Finally, for the convenience of the readers and instructors, most of 
my notes for AEB 6571 are available online at http://ricardo.ifas.ufl.edu/ 
aeb6571.econometrics/. The datasets and programs used in this book are avail- 
able at http://www.charlesbmoss.com:8080/MathStat /. 
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Defining Mathematical Statistics 


CONTENTS 

1.1 Mathematical Statistics and Econometrics ..................008. 3 
1.1.1 Econometrics and Scientific Discovery .................. 5 
1.1.2 Econometrics and Planning ................ 0... e eee eee 12 

1.2 Mathematical Statistics and Modeling Economic Decisions ..... 14 

1.3 Chapter Stmmary” och. eissix. eedete Wes aaawaveapiwae dani eves 17 

1.4 Review Questions .......... 0. cee cece cece eee ence teen ence eaeees 18 


At the start of a course in mathematical statistics students usually ask three 
questions. Two of these questions are typically what is this course going to be 
about and how is this different from the two or three other statistics courses 
that most students have already taken before mathematical statistics? The 
third question is how does the study of mathematical statistics contribute to 
my study of economics and econometrics? The simplest answer to the first 
question is that we are going to develop statistical reasoning using mathe- 
matical techniques. It is my experience that most students approach statistics 
as a toolbox, memorizing many of the statistical estimators and tests (see 
box titled Mathematical Statistics — Savage). This course develops the 
student’s understanding of the reasons behind these tools. Ultimately, math- 
ematical statistics form the basis of econometric procedures used to analyze 
economic data and provide a firm basis for understanding decision making 
under risk and uncertainty. 

Mathematical statistics share the same linkage to statistics that mathe- 
matical economics has to economics. In mathematical economics, we develop 
the consequences of economic choice of such primal economic concepts as con- 
sumer demand and producer supply. Focusing on demand, we conceptualize 
how a concave set of ordinal preferences implies that consumers will choose a 
unique bundle of goods given any set of prices and level of income. By exten- 
sion, we can follow the logic to infer that these conditions will lead to demand 
curves that are downward sloping and quasi-convex in price space by Roy’s 
identity. Thus, any violation of these conditions (i.e., downward sloping and 
quasi-convexity) implies that the demand curve is not consistent with a unique 
point on an ordinal utility map. Hence, the development of logical connections 
using mathematical precision gives us a logical structure for our analysis. 
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Mathematical Statistics — Savage 


In the present century there has been and continues to be extraordi- 
nary interest in mathematical treatment of problems of inductive 
inference. For reasons I cannot and need not analyze here, this 
activity has been strikingly concentrated in the English-speaking 
world. It is known under several names, most of which stress some 
aspects of the subject that seemed of overwhelming importance at 
the moment when the name was coined. “Mathematical statistics,” 
one of its earliest names, is still the most popular. In this name, 
“mathematical” seems to be intended to connote rational, theoret- 
ical, or perhaps mathematically advanced, to distinguish the sub- 
ject from those problems of gathering and condensing numerical 
data that can be considered apart from the problem of inductive 
inference, the mathematical treatment of which is generally triv- 
ial. The name “statistical inference” recognizes that the subject is 
concerned with inductive inference. The name “statistical decision” 
reflects the idea that inductive inference is not always, if ever, con- 
cerned with what to believe in the face of inconclusive evidence, 
but that at least sometimes it is concerned with what action to 
decide upon under such circumstances [41, p. 2]. 


The linkage between mathematical statistics and statistics is similar. The 
theory of statistical inference is based on primal concepts such as estima- 
tion, sample design, and hypothesis testing. Mathematical statistics allow for 
the rigorous development of this statistical reasoning. Conceptually, we will 
define what is meant by a random variable, how the characteristics of this 
random variable are linked with a distribution, and how knowledge of these 
distributions can be used to design estimators and hypothesis tests that are 
meaningful (see box titled the Role of Foundations — Savage). 


The Role of Foundations — Savage 


It is often argued academically that no science can be more se- 
cure than its foundations, and that, if there is controversy about 
the foundations, there must be even greater controversy about the 
higher parts of the science. As a matter of fact, the foundations are 
the most controversial parts of many, if not all sciences. Physics 
and pure mathematics are excellent examples of this phenomenon. 
As for statistics, the foundations include, on any interpretation of 
which I have ever heard, the foundations of probability, as contro- 
versial a subject as one could name. As in other sciences, contro- 
versies over the foundations of statistics reflect themselves to some 
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extent in everyday practice, but not nearly so catastrophically as 
one might imagine. I believe that here as elsewhere, catastrophe 
can be avoided, primarily because in practical situations common 
sense generally saves all but the most pedantic of us from flagrant 
error [41, p. 1]. 


However, in our development of these mathematical meanings in statistics, 
we will be forced to consider the uses of these procedures in economics. For 
much of the twentieth century, economics attempted to define itself as the 
science that studies the allocation of limited resources to meet unlimited and 
competing human wants and desires [4]. However, the definition of economics 
as a science may raise objections inside and outside the discipline. Key to the 
definition of a field of study as a science is the ability or willingness of its 
students and practitioners to allow its tenets to be empirically validated. In 
essence, it must be possible to reject a cherished hypothesis based on empirical 
observation. It is not obvious that economists have been willing to follow 
through with this threat. For example, remember our cherished notion that 
demand curves must be downward sloping and quasi-convex in price space. 
Many practitioners have estimated results where these basic relationships are 
violated. However, we do not reject our so-called “Law of Demand.” Instead we 
expend significant efforts to explain why the formulation yielding this result 
is inadequate. In fact, there are several possible reasons to suspect that the 
empirical or econometric results are indeed inadequate, many of which we 
develop in this book. My point is that despite the desire of economists to 
be classified as a scientists, economists are frequently reticent to put theory 
to an empirical test in the same way as a biologist or physicist. Because of 
this failure, economics largely deserves the suspicion of these white coated 
practitioners of more basic sciences. 


1.1 Mathematical Statistics and Econometrics 


The study of mathematical statistics by economists typically falls under a 
broad sub-discipline called econometrics. Econometrics is typically defined as 
the use of statistics and mathematics along with economic theory to describe 
economic relationships (see the boxes titled Tinbergen on Econometrics 
and Klein on Econometrics). The real issue is what do we mean by de- 
scribe? There are two dominant ideas in econometrics. The first involves the 
scientific concept of using statistical techniques (or more precisely, statistical 
inference) to test implications of economic theory. Hence, in a traditional sci- 
entific paradigm, we expose what we think we know to experience (see the box 
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titled Popper on Scientific Discovery). The second use of econometrics in- 
volves the estimation of parameters to be used in policy analysis. For example, 
economists working with a state legislature may be interested in estimating 
the effect of a sales tax holiday for school supplies on the government’s sales 
tax revenue. As a result, they may be more interested in imposing economi- 
cally justified restrictions that add additional information to their data rather 
than testing these hypotheses. The two uses of econometrics could then be 
summarized as scientific uses versus the uses of planners. 


Tinbergen on Econometrics 


Econometrics is the name for a field of science in which 
mathematical-economic and mathematical-statistical research are 
applied in combination. Econometrics, therefore, forms a border- 
land between two branches of science, with the advantages and 
disadvantages thereof; advantages, because new combinations are 
introduced which often open up new perspectives; disadvantages, 
because the work in this field requires skill in two domains, which 
either takes up too much time or leads to insufficient training of 
its students in one of the two respects [51, p. 3]. 


Klein on Econometrics 


The purely theoretical approach to econometrics may be envisioned 
as the development of that body of knowledge which tells us how 
to go about measuring economic relationships. This theory is often 
developed on a fairly abstract or general basis, so that the results 
may be applied to any one of a variety of concrete problems that 
may arise. The empirical work in econometrics deals with actual 
data and sets out to make numerical estimates of economic rela- 
tionships. The empirical procedures are direct applications of the 
methods of theoretical econometrics [24, p. 1]. 


Popper on Scientific Discovery 


A scientist, whether theorist or experimenter, puts forward state- 
ments, or systems of statements, and tests them step by step. In the 
field of the empirical sciences, more particularly, he constructs hy- 
potheses, or systems of theories, and tests them against experience 
by observation and experiment. 

I suggest that it is the task of the logic of scientific discovery, or 
logic of knowledge, to give a logical analysis of this procedure; that 
is to analyse the method of empirical sciences [38, p. 3]. 
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1.1.1 Econometrics and Scientific Discovery 


The most prominent supporters of the traditional scientific paradigm to econo- 
metrics are Theil, Kmenta, and Spanos. According to Theil, 


Econometrics is concerned with the empirical determination of eco- 
nomic laws. The word “empirical” indicates that the data used for 
this determination have been obtained from observation, which may 
be either controlled experimentation designed by the econometrician 
interested, or “passive” observation. The latter type is as prevalent 
among economists as it is among meterologists [49, p.1]. 


Kamenta [26] divides statistical applications in economics into descriptive 
statistics and statistical inference. Kmenta contends that most statistical ap- 
plications in economics involve applications of statistical inference, that is, the 
use of statistical data to draw conclusions or test hypotheses about economic 
behavior. Spanos states that “econometrics is concerned with the systematic 
study of economic phenomena using observed data” [45, p. 3]. 


How it all began — Haavelmo 


The status of general economics was more or less as follows. There 
were lots of deep thoughts, but a lack of quantitative results. Even 
in simple cases where it can be said that some economic magnitude 
is influenced by only one causal factor, the question of how strong 
is the influence still remains. It is usually not of very great practical 
or even scientific interest to know whether the influence is positive 
or negative, if one does not know anything about the strength. But 
much worse is the situation when an economic magnitude to be 
studied is determined by many different factors at the same time, 
some factors working in one direction, others in the opposite di- 
rections. One could write long papers about so-called tendencies 
explaining how this factor might work, how that factor might work 
and so on. But what is the answer to the question of the total net 
effect of all the factors? This question cannot be answered without 
measures of the strength with which the various factors work in 
their directions. The fathers of modern econometrics, led by the 
giant brains of Ragnar Frisch and Jan Tinbergen, had the vision 
that it would be possible to get out of this situation for the science 
of economics. Their program was to use available statistical mate- 
rial in order to extract information about how an economy works. 
Only in this way could one get beyond the state of affairs where 
talk of tendencies was about all one could have as a result from 
even the greatest brains in the science of economics [15]. 
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Nature of Econometrics — Judge et al. 


If the goal is to select the best decision from the economic choice 
set, it is usually not enough just to know that certain economic 
variables are related. To be really useful we must usually also know 
the direction of the relation and in many cases the magnitudes 
involved. Toward this end, econometrics, using economic theory, 
mathematical economics, and statistical inference as an analytical 
foundation and economic data as the information base, provides an 
inferential basis for: 

(1) Modifying, refining, or possibly refuting conclusions contained 
in economic theory and/or what represents current knowledge 
about economic processes and institutions. 

(2) Attaching signs, numbers, and reliability statements to the co- 
efficient of variables in economic relationships so that this informa- 
tion can be used as a basis for decision making and choice [23, p. 
1). 


A quick survey of a couple of important economics journals provides a look 
at how econometrics is used in the development of economic theory. Ashraf and 
Galor [2] examine the effect of genetic diversity on economic growth. Specifi- 
cally, they hypothesize that increased genetic diversity initially increases eco- 
nomic growth as individuals from diverse cultures allow the economy to quickly 
adopt a wide array of technological innovations. However, this rate of increase 
starts to decline such that the effect of diversity reaches a maximum as the 
increased diversity starts to impose higher transaction costs on the economy. 
Thus, Ashraf and Galor hypothesize that the effect of diversity on population 
growth is “hump shaped.” To test this hypothesis, they estimate two empirical 
relationships. The first relationship examines the effect of genetic diversity on 
each country’s population density. 


In (P;) = Bo+61Gi+82G7+ Bs In (Ti) +84 In (X14) + Bs In (Xai) +86 In Bae 

1.1 
where In(P;) is the natural logarithm of the population density for country 
i, G; is a measure of genetic diversity in country 7, T; is the time in years 
since the establishment of agriculture in country 7, X41; is the percentage of 
arable land in country 7, X2; is the absolute latitude of country i, X3; is a 
variable capturing the suitability of land in country 7 for agriculture, and €; is 
the residual. The second equation then estimates the effect of the same factors 
on each country’s income per capita. 


In (yi) = ot UGit72G7 +73 In (Ti) +74 In (X15) +75 In (Xai) +76 In (X3s) + v4 

(1.2) 
where y; represents the income per capita and G; is the estimated level of ge- 
netic diversity. Ashraf and Galor use the estimated genetic diversity to adjust 
for the relationship between genetic diversity and the path of development 
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TABLE 1.1 
Estimated Effect of Genetic Diversity on Economic Development 
Population Income 
Variable Density per Capita 
Genetic Diversity (G;) 225.440*** 203.443** 
(73.781)% (83.368) 
Genetic Diversity Squared (G?) -3161.158** -142.663** 
(56.155) (59.037) 
Emergence of Agriculture (In(7;)) —-1.214*** -0.151 
(0.373) (0.197) 
Percent of Arable Land (In(X1;)) 0.516*** -0.112 
(0.165) (0.103) 
Absolute Latitude (In(X9;)) -0.162 0.163 
(0.130) (0.117) 
Land Suitability (In(X3;)) 0.571* -0.192** 
(0.294) (0.096) 
Ft? 0.89 0.57 


“ Numbers in parenthesis denote standard errors. *** denotes statistical 
significance at the 0.01 level of confidence, ** denotes statistical 
significance at the 0.05 level of confidence, and * denotes statistical 
significance at the 0.10 level of confidence. 

Source: Ashraf and Galor [2] 


from Africa to other regions of the world (i.e., the “Out of Africa” hypothe- 
sis). The statistical results of these estimations presented in Table 1.1 support 
the theoretical arguments of Ashraf and Galor. 

In the same journal, Naidu and Yuchtman [35] examine whether the “Mas- 
ter and Servant Act” used to enforce labor contracts in Britain in the nine- 
teenth century affected wages. At the beginning of the twenty-first century a 
variety of labor contracts exist in the United States. Most hourly employees 
have an implicit or continuing consent contract which is not formally bind- 
ing on either the employer or the employee. By contrast, university faculty 
typically sign annual employment contracts for the upcoming academic year. 
Technically, this contract binds the employer to continue to pay the faculty 
member the contracted amount throughout the academic year unless the fac- 
ulty member violates the terms of this contract. However, while the faculty 
member is bound by the contract, sufficient latitude is typically provided for 
the employee to be released from the contract before the end of the academic 
year without penalty (or by forfeiting the remaining payments under the con- 
tract). Naidu and Yuchtman note that labor laws in Britain (the Master and 
Servant Act of 1823) increased the enforcement of these labor contracts by 
providing both civil and criminal penalties for employee breach of contract. 
Under this act employees who attempted to leave a job for a better opportu- 
nity could be forced back into the original job under the terms of the contract. 
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TABLE 1.2 

Estimates of the Effect of Master and Servant Prosecutions on Wages 
Variable Parameter 
Fraction of Textiles x In(Cotton Price) 159.3*** 
(42.02) 
Iron County x In(Iron Price) 51.98** 
(19.48) 
Coal County x In(Coal Price) 41.25*** 
(10.11) 
In(Population) 79.13** 
(35.09) 


“ Numbers in parenthesis denote standard errors. *** denotes statistical 
significance at the 0.01 level of confidence, and ** denotes statistical 
significance at the 0.05 level of confidence. 

Source: Naidu and Yuchtman [35] 


Naidu and Yuchtman develop an economic model which indicates that the en- 
forcement of this law will reduce the average wage rate. Hence, they start their 
analysis by examining factors that determine the number of prosecutions un- 
der the Master and Servant laws for counties in Britain before 1875. 


Zit = A +015; X X14 + Alo; x In (X24) + a3l3,; In (X34) 
+ ag In (pi) + it (1.3) 


where Z;; is the number of prosecutions under the Master and Servant Act in 
county 7 in year t, S; is the share of textile production in county 7 in 1851, 
Xj, is the cotton price at time t, Iz; is a dummy variable that is 1 if the 
county produces iron and 0 otherwise, X27 is the iron price at time t, [3,; is 
a dummy variable that is 1 if the county produces coal and 0 otherwise, X3 1 
is the price of coal, p; 4 is the population of county 2 at time t, and €j is the 
residual. The results for this formulation are presented in Table 1.2. Next, 
Naidu and Yuchtman estimate the effect of these prosecutions on the wage 
rate. 


wit = Bo + Bilae x In (Z;) + B2X5,ie + B3-Xo,ie + Balm (X7 it) 
+ Bs In (pit) + BoXa,it + Vie (1.4) 


where wij; is the average wage rate in county 7 at time t, J; is a dummy variable 
that is 1 if t > 1875 (or after the repeal of the Master and Servant Act) and 
0 otherwise, X5 iz is the population density of county 2 at time t, X6i¢ is the 
proportion of the population living in urban areas in county 7 at time t, X7,iz is 
the average income in county 7 at time t, Xg iz is the level of union membership 
in county 7 at time t, and 1, is the residual. The results presented in Table 1.3 
provide weak support (i.e., at the 0.10 level of significance) that prosecutions 
under the Master and Servant Act reduced wages. Specifically, the positive 
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TABLE 1.3 
Effect of Master and Servant Prosecutions on the Wage Rate 
Variable Parameter 
Post-1875 x In(Average Prosecutions) 0.0122* 
0.0061) 
Population Density -0.0570 
0.0583) 
Proportion Urban -0.0488 
0.0461) 
In(Income) 0.0291 
0.0312) 
In(Population) 0.0944** 
0.0389) 
Union Membership 0.0881 
0.0955) 


* Numbers in parenthesis denote standard errors. ** denotes statistical 
significance at the 0.05 level of confidence and * denotes statistical 
significance at the 0.10 level of confidence. 

Source: Naidu and Yuchtman [35] 


coefficient on the post-1875 variable indicates that wages were 0.0122 shillings 
per hour higher after the Master and Servant Act was repealed in 1875. 

As a final example, consider the research of Kling et al. [25], who examine 
the role of information in the purchase of Medicare drug plans. In the Medicare 
Part D prescription drug insurance program consumers choose from a menu of 
drug plans. These different plans offer a variety of terms, including the price 
of the coverage, the level of deductability (i-e., the lower limit required for the 
insurance to start paying benefits), and the amount of co-payment (e.g., the 
share of the price of the drug that must be paid by the senior). Ultimately con- 
sumers make a variety of choices. These differences may be driven in part by 
differences between household circumstances. For example, some seniors may 
be in better health than others. Alternatively, some households may be in 
better financial condition. Finally, the households probably have different at- 
titudes toward risk. Under typical assumptions regarding consumer behavior, 
the ability to choose maximizes the benefits from Medicare Part D to seniors. 
However, the conjecture that consumer choice maximizes the benefit from the 
Medicare drug plans depends on the consumer’s ability to understand the 
benefits provided by each plan. This concept is particularly important given 
the complexity of most insurance packages. Kling et al. analyze the possibil- 
ity of comparison friction. Comparison friction is a bias from switching to a 
possibly better product because the two products are difficult to compare. To 
analyze the significance of comparison friction Kling et al. construct a sample 
of seniors who purchase Medicare Part D coverage. Splitting this sample into 
a control group and an intervention (or treatment) group, the intervention 
group was then provided personalized information about how each alternative 
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would affect the household. The control group was then given access to a web- 
page which could be used to construct the same information. The researchers 
then observed which households switched their coverage. The sample was then 
used to estimate 


Dj = 9 + 0 Zi + Q.X 4; + RXQj + A4X3i + RX 4 + AGX5; 


iE 
a7 Xi + AgX7% + agXgi + A19Xgi + 11 X101 + & it) 


where D; is one if the household switches its plan and zero otherwise, Z; is 
the intervention variable equal to one if the household was provided individual 
information, X1; is a dummy variable which is one if the head of household is 
female, Xo; is one if the head of household is married, X3; is one if the indi- 
vidual finished high school, X4; is one if the participant finished college, X35; is 
one if the individual completed post-graduate studies, X¢; is one if the partic- 
ipant is over 70 years old, X7; is one if the participant is over 75 years old, X8; 
is one if the individual has over four medications, Xo; is one if the participant 
has over seven medications, and Xj; is one if the household is poor. 

Table 1.4 presents the empirical results of this model. In general these 
results confirm a comparison friction since seniors who are given more in- 
formation about alternatives are more likely to switch (i.e., the estimated 
intervention parameter is statistically significant at the 0.10 level). However, 
the empirical results indicate that other factors matter. For example, mar- 
ried couples are more likely to switch. In addition, individuals who take over 
seven medications are more likely to switch. Interestingly, individual levels of 
education (i.e., the high school graduate, college graduate, and post-college 
graduate variables) are not individually significant. However, further testing 
would be required to determine whether education was statistically informa- 
tive. Specifically, we would have to design a statistical test that simultaneously 
restricted all three parameters to be zero at the same time. As constructed, we 
can only compare each individual effect with the dropped category (probably 
that the participant did not complete high school). 

In each of these examples, data is used to test a hypothesis about individual 
behavior. In the first study (Ashraf and Galor [2]), the implications of indi- 
vidual actions on the aggregate economy (i.e., nations) are examined. Specif- 
ically, does greater diversity lead to economic growth? In the second study, 
Naidu and Yuchtman [35] reduced the level of analysis to the region, asking 
whether the Master and Servant Act affected wages at the parish (or county) 
level. In both scenarios the formulation does not model the actions themselves 
(i.e., whether genetic diversity improves the ability to carry out a variety of 
activities through a more diverse skill set or whether the presence of labor re- 
strictions limited factor mobility) but the consequences of those actions. The 
last example (Kling et al. [25]) focuses more directly on individual behavior. 
However, in all three cases an economic theory is faced with observations. 

On a somewhat related matter, econometrics positions economics as a pos- 
itive science. Econometrics is interested in what happens as opposed to what 
should happen (i.e., a positive instead of a normative science; see box The 
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TABLE 1.4 
Effect of Information on Comparison Friction 
Variable Parameter 
Intervention 0.098* 
(0.041) 
Female —0.023 
(0.045) 
Married 0.107* 
(0.045) 
High School Graduate —0.044 
(0.093) 
College Graduate 0.048 
(0.048) 
Post-college Graduate —0.084 
(0.062) 
Age 70+ —0.039 
(0.060) 
Age 75+ 0.079 
(0.048) 
4+ Medications —0.054 
(0.050) 
7+ Mediations 0.116* 
(0.052) 
Poor 0.097* 
(0.045) 


* Numbers in parenthesis denote standard errors. * denotes statistical 
significance at the 0.10 level of confidence. 
Source: Kling et al. [25] 


Methodology of Positive Economics — Friedman). In the forgoing dis- 
cussion we were not interested in whether increased diversity should improve 
economic growth, but rather whether it could be empirically established that 
increased diversity was associated with higher economic growth. 


The Methodology of Positive Economics — Friedman 


... the problem how to decide whether a suggested hypothesis or 
theory should be be tentatively accepted as part of the “body of 
systematized knowledge concerning what is.” But the confusion 
[John Neville] Keynes laments is still so rife and so much a hin- 
drance of the recognition that economics can be, and in part is, 
a positive science that it seems to preface the main body of the 
paper with a few remarks about the relation between positive and 
normative economics. 
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... Self-proclaimed “experts” speak with many voices and can 
hardly all be regarded as disinterested; in any event, on questions 
that matter so much, “expert” opinion could hardly be accepted 
soley on faith even if the “experts” were nearly unanimous and 
clearly disinterested The conclusions of positive economics seem 
to be, and are, immediately relevant to important normative prob- 
lems, to questions of what ought to be done and how any given goal 
can be attained. Laymen and experts alike are inevitably tempted 
to shape positive conclusions to fit strongly held normative pre- 
conceptions and to reject positive conclusions if their normative 
implications — or what are said to be their normative implications 
— are unpalatable. 

Positive economics is in principle independent of any partic- 
ular ethical position or normative judgments. As Keynes says, it 
deals with “what is,” not with “what ought to be.” Its task is to 
provide a system of generalizations that can be used to make cor- 
rect predictions about the consequences of any change in circum- 
stances. Its performance is to be judged by the precision, scope, 
and conformity with experience of the predictions it yields [13, pp. 
3-5]. 


1.1.2 Econometrics and Planning 


While the interaction between governments and their economies is a subject 
beyond the scope of the current book, certain features of this interaction are 
important when considering the development of econometrics and the role of 
mathematical statistics within that development. For modern students of eco- 
nomics, the history of economics starts with the classical economics of Adam 
Smith [44]. At the risk of oversimplication, Smith’s insight was that markets 
allowed individuals to make choices that maximized their well-being. Aggre- 
gated over all individuals, these decisions acted like an invisible hand that 
allocated resources toward the production of goods that maximized the over- 
all well-being of the economy. This result must be viewed within the context 
of the economic thought that the classical model replaced — mercantilism [43]. 
Historically the mercantile system grew out of the cities. Each city limited the 
trade in raw materials and finished goods in its region to provide economic 
benefits to the city’s craftsmen and merchants. For example, by prohibiting 
the export of wool (or by imposing significant taxes on those exports) the re- 
sulting lower price would benefit local weavers. Smith’s treatise demonstrated 
that these limitations reduced society’s well-being. 

The laissez-faire of classical economics provided little role for econometrics 
as a policy tool. However, the onset of the Great Depression provided a sig- 
nificantly greater potential role for econometrics (see box Government and 
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Economic Life — Staley). Starting with the Herbert Hoover administra- 
tion, the U.S. government increased its efforts to stimulate the economy. This 
shift to the managed economy associated with the administration of Franklin 
Roosevelt significantly increased the use of econometrics in economic policy. 
During this period, the National Income and Product Accounts (NIPA) were 
implemented to estimate changes in aggregate income and the effect of a va- 
riety of economic policies on the aggregate economy. Hence, this time period 
represents the growth of econometrics as a planning tool which estimates the 
effect of economic policies such as changes in the level of money supply or in- 
creases in the minimum wage (see box Economic Planning — Tinbergen). 


Government and Economic Life — Staley 


The enormous increase in the economic role of the state over the 
last few years has the greatest possible importance for the future 
of international economic relationships. State economic activities 
have grown from such diverse roots as wartime needs, the fear of 
war and the race for rearmament and military self-sufficiency, the 
feelings of the man in the street on the subject of poverty in the 
midst of plenty, innumerable specific pressures from private inter- 
ests, the idea of scientific management, the philosophy of collec- 
tivist socialism, the totalitariam philosophy of the state, the sheer 
pressure of economic emergency in the depression, and the accep- 
tance of the idea that it is the state’s business not only to see that 
nobody starves but also to ensure efficient running of the economic 
machine... 

Governments have taken over industries of key importance, such 
as munition factories in France, have assumed the management 
of public utility services, as under the Central Electricity Board in 
Great Britian, and have set up public enterprises to prepare the way 
for development of whole regions and to provide “yardsticks” for 
private industries, as in the case of the Tennessee Valley Authority 
in the United States [46, pp. 128-129]. 


Economic Planning — Tinbergen 


This study deals with the process of central economic planning, or 
economic planning by governments. It aims at a threefold treat- 
ment, which may be summarized as follows: (a) to describe the 
process of central planning, considered as one of the service in- 
dustries of the modern economy; (b) to analyze its impact on the 
general economic process; (c) to indicate, as far as possible, the 
optimal extent and techniques of central planning [52, p. 3] 

The orign of the planning techniques applied today clearly 
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springs from two main sources: Russian communist planning and 
Western macroplanning.... 

Western macroeconomic planning had a very different origin, 
namely the desire to understand the operation of the economy as a 
whole. It was highly influenced by the statistical concepts relevant 
to national or social accounts and by Keynesian concepts, combined 
with market analysis, which later developed into macroeconomic 
econometric models. There was still a basic belief that many de- 
tailed decisions could and should be left to the decentralized system 
of single enterprises and that guidance by the government might 
confine itself to indirect intervention with the help of a few instru- 
ments only [52, pp. 4-5]. 


While Tinbergen focuses on the role of econometrics in macroeconomic 
policy, agricultural policy has generated a variety of econometric applications. 
For example, the implementation of agricultural policies such as loan rates [42] 
results in an increase in the supply of crops such as corn and wheat. Econo- 
metric techniques are then used to estimate the effect of these programs on 
government expenditures (i.e., loan deficiency payments). The passage of the 
Energy Independence and Security Act of 2007 encouraged the production of 
biofuels by requiring that 15 billion gallons of ethanol be added to the gaso- 
line consumed in the United States. This requirement resulted in corn prices 
significantly above the traditional loan rate for corn. The effect of ethanol on 
corn prices increased significantly with the drought in the U.S. Midwest in 
2012. The combination of the drought and the ethanol requirement caused 
corn prices to soar, contributing to a significant increase in food prices. This 
interaction has spawned numerous debates, including pressure to reduce the 
ethanol requirements in 2014 by as much as 3 billion gallons. At each step of 
this policy debate, various econometric analyses have attempted to estimate 
the effect of policies on agricultural and consumer prices as well as government 
expenditures. In each case, these econometric applications were not intended 
to test economic theory, but to provide useful information to the policy pro- 
cess. 


1.2 Mathematical Statistics and Modeling Economic 
Decisions 


Apart from the use of statistical tools for inference, mathematical statistics 
also provides several concepts useful in the analysis of economic decisions 
under risk and uncertainty. Moss [32] demonstrates how probability theory 
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FIGURE 1.1 
Standard Normal Density Function. 


contributes to the derivation of the Expected Utility Hypothesis. Apart from 
the use of mathematical statistics in the development of theory, these tools are 
also important for the development of several important applied methodologies 
for dealing with risk and uncertainty, such as the Capital Asset Pricing Model, 
Stochastic Dominance, and Option Pricing Theory. 

Skipping ahead a little bit, the normal distribution function depicts the 
probability density for a given outcome x as a function of the mean and 
variance of the distribution. 


Pere). = = oo ge ) (1.6) 


Graphically, the shape of the function under the assumptions of the “standard 
normal” (i.e., 4 = 0 and o? = 1) is depicted in Figure 1.1. This curve is 
sometimes referred to as the Bell Curve. Statistical inference typically involves 
designing a probabilistic measure for testing a sample of observations drawn 
from this data set against an alternative assumption, for example, = 0 
versus jt = 2. The difference in these distributions is presented in Figure 1.2. 

An alternative economic application involves the choice between the two 
distribution functions. For example, under what conditions does a risk averse 
producer prefer the alternative that produces each distribution?! Figure 1.3 


1The optimizing behavior for risk averse producers typically involves a choice between 
combinations of expected return and risk. Under normality the most common measure of 
risk is the variance. In the scenario where the expected return (or mean) is the same, decision 
makers prefer the alternative that produces the lowest risk (or variance). 
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FIGURE 1.2 
Normal Distributions with Different Means. 
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FIGURE 1.3 
Alternative Normal Distributions. 


presents the distribution functions of profit (7) for two alternative actions that 
a decision maker may choose. Alternative 1 has a mean of 0 and a variance of 
1 (i.e., is standard normal) while the second distribution has a mean of 0.75 
with a standard deviation of 1.25. There are a variety of ways to compare these 
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FIGURE 1.4 
First Degree Stochastic Dominance — A Comparison of Cumulative Distribu- 
tion Functions. 


two alternatives; one is first degree stochastic dominance, which basically asks 
whether one alternative always has a higher probability of producing a higher 
return [32, pp. 150-152]. First degree stochastic dominance involves comparing 
the cumulative distribution function (i.e., the probability the random variable 
will be less than or equal to any value) for each alternative. As presented 
in Figure 1.4, Alternative 2 dominates (i.e., provides a higher return for the 
relative risk) Alternative 1. 


1.3. Chapter Summary 


e Mathematical statistics involves the rigorous development of statistical 
reasoning. 


— The goal of this textbook is to make the student think about statistics 
as more than a toolbox of techniques. 


— These mathematical statistic concepts form the basis of econometric 
formulations. 


e Our analysis of mathematical statistics raises questions regarding our def- 
inition of economics as a science versus economics as a tool for decision 
makers. 
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— Following Popper, a science provides for the empirical testing of a 
conjecture. 


— Popper does not classify the process of developing conjectures as a 
science, but the scientific method allows for experimental (or experi- 
ential) testing of its precepts. 


— This chapter reviews some examples of empirical tests of economic 
hypotheses. However, it also points to cases where simple tests of 
rather charished economic theories provide dubious results. 


e In addition to using econometric/statistical concepts to test theory, these 
procedures are used to inform economic decisions. 


— Econometric analysis can be used to inform policy decisions. For ex- 
ample, we may be interested in the gains and losses from raising the 
minimum wage. However, econometric analysis may indicate that this 
direct effect will be partially offset by increased spending from those 
benefiting from the increase in the minimum wage. 


— In addition, mathematical statistics helps us model certain decisions 
such as producer behavior under risk and uncertainty. 


1.4 Review Questions 


1-1R. What are the two primary uses of econometrics? 


1-2R. Review a recent issue of the American Economic Review and discuss 
whether the empirical applications test economic theory or provide es- 
timates useful for policy analysis. 


1-3R. Review a recent issue of the American Economic Journal: Economic 
Policy and discuss whether the empirical applications test economic 
theory or provide estimates useful for policy analysis. 


1-4R. Discuss a scenario where mathematical statistics informs economic the- 
ory in addition to providing a means of scientific testing. 


Part I 


Defining Random Variables 
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A primary building block of statistical inference is the concept of probability. 
However, this concept has actually been the subject of significant debate over 
time. Hacking [16] provides a detailed discussion of the evolution of the concept 
of probability: 


Probability has two aspects. It is connected with the degree of belief 
warranted by evidence, and it is connected with the tendency, displayed 
by some chance devices, to produce stable relative frequencies [16, p. 1]. 


As developed by Hacking, the concept of probability has been around for thou- 
sands of years in the context of games of chance. This concept of probability 
follows the idea of the tendency of chance displayed by some device — dice or 
cards. The emergence of a formal concept of probability can be found in corre- 
spondence between the mathematical geniuses Pierre de Fermat (1601-1665) 
and Blaise Pascal (1623-1662). 

The correspondence between Fermat and Pascal attempts to develop a rule 
for the division of an incomplete game. The incomplete game occurs when a 
multiple stage contest is interrupted (i.e., the game is completed when one 
individual rolls a certain value eight times but the play is interrupted after 
seven rolls). 


Let us say that I undertake to win a point with a single die in eight 
rolls; if we agree, after the money is already in the game, that I will 
not make my first roll, then according to my principle it is necessary 
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that I take out 1/6 of the pot to be bought out, since I will be unable 
to roll to win in the first round (Letter of Fermat to Pascal [9]). 


The first textbook on probability following this gaming approach was then 
produced by Christiaan Huygens [20] in 1657. 

Typically, econometrics based its concept of probability loosely on this 
discrete gaming approach. As described by Tinbergen, 


A clear example which tells us more than abstract definitions is the 
number of times we can throw heads with one or more coins. If we 
throw with one coin, that number can be 0 or 1 on each throw; if we 
throw with three coins, it can be 0, 1, 2, or 3. The number of times each 
of these values appears is its frequency; the table of these frequencies 
is the frequency distribution. If the latter is expressed relatively, i-e., in 
figures the sum of the total of which is 1 and which are proportionate to 
the frequencies, we speak of probability distribution. The probability 
or relative frequency of the appearance of one certain result indicates 
which part of the observations leads to that outcome [51, pp. 60-61]. 


This pragmatic approach differs from more rigorous developments offered by 
Lawrence Klein [24, pp. 23-28]. 

Skipping some of the minutiae of the development of probability, there are 
three approaches. 


1. The frequency approach — following Huygens, probability is simply de- 
fined by the relative count of outcomes. 


2. Personal probabilities where individuals anticipate the likelihood of 
outcomes based on personal information. This approach is similar to the 
concept of utility developed in consumer theory. The formulation is pri- 
marily based on Leonard Savage [41] and Bruno de Finetti [10]. 


3. Axiomatic probabilities where the properties of a probability function 
are derived from basic conjectures. The axiomatic approach is typically 
attributed to Andrei Kolmogorov [39, pp. 137-143]. 


The frequency approach has the advantage of intuition. A variety of board 
games involve rolling dice, from Monopoly to Risk™. The growth in pop- 
ularity of state lotteries from the late 1980s through the early 2000s to fund 
various government programs has increased the popular knowledge of drawing 
without replacement. Finally, the growth of the casino industries in states such 
as Mississippi and on Indian reservations has extended participation in games 
of chance beyond the traditional bastions of Las Vegas and Reno, Nevada, 
and Atlantic City, New Jersey. 

However, some central concepts of probability theory are difficult for a 
simple frequency motivation. For example, the frequency approach is intu- 
itively appealing for discrete outcomes (i.e., the number of points depicted on 
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the top of a die or whether a coin lands heads or tails up). The intuition of 
the frequentist vanishes when the variable of interest is continuous, such as 
the annual rainfall or temperature in a crop region. In these scenarios, the 
probability of any particular outcome is numerically zero. Further, there are 
some outcomes that are actually discrete but the possible number of outcomes 
is large enough that the frequency of outcomes approaches zero. For example, 
the observed Dow Jones Industrial Average reported in the news is a discrete 
number (value in hundredths of a dollar or cents). Similarly, for years stock 
prices were reported in sixteenths of a cent. 

As one example, consider the rainfall in Sayre, Oklahoma, for months im- 
portant for the production of hard red winter wheat (Table 2.1). Typically 
hard red winter wheat is planted in western Oklahoma in August or Septem- 
ber, so the expected rainfall in this period is important for the crop to sprout. 
Notice that while we may think about rainfall as a continuous variable, the 
data presented in Table 2.1 is discrete (i.e., there is a countable number of 
hundreths of an inch of rainfall). In addition, we will develop two different 
ways of envisioning this data. From an observed sample framework we will 
look at the outcomes in Table 2.1 as equally likely like the outcomes of the 
roll of dice. From this standpoint, we can create the empirical distribution 
function (i.e., the table of outcomes and probabilities of those outcomes that 
we will develop more fully in Chapter 3) for rainfall in August and September 
presented in Table 2.2. An interesting outcome of the empirical distribution 
function is that the outcome of 7.26 inches of rainfall is twice as likely as any 
other outcome. Looking ahead at more empirical approaches to probability, 
suppose that we constructed a histogram by counting the number of times the 
rainfall occurred between inches (i.e., in three years the rainfall was between 
1 and 2 inches). Note that the table excludes the rainfall of 17.05 inches in 
the 95/96 crop year. In the terms of probability, I am tentatively classifying 
this outcome as an outlier. This data is presented in Table 2.3. This table 
shows us that different rainfall amounts are really not equally likely, for ex- 
ample, the empirical probability that rainfall between 5 and 7 inches is more 
likely than between 3 and 4 inches (i.e., 0.170 is greater than 0.094). This 
general approach of constructing probability functions can be thought of as a 
frequentist application to continuous data. 

In addition to the practical problem of continuous distributions, the fre- 
quentist approach did not provide a rigorous concept of the properties of 
probability: 


By the end of the seventeenth century the mathematics of many sim- 
ple (and some not-so-simple) games of chance was well understood and 
widely known. Fermat, Pascal, Hugens, Leibniz, Jacob Bernoulli, and 
Arbutnot all examined the ways in which the mathematics of permu- 
tations and combinations could be employed in the enumeration of 
favorable cases in a variety of games of known properties. But this 
early work did not extend to the consideration of the problem: How, 
from the outcome of a game (or several outcomes of the same game), 
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TABLE 2.2 
Empirical Probability Distribution of Rainfall 


Rainfall Count Rainfall Count Rainfall Count Rainfall Count 


0.00 1 3.26 I 4.77 I 7.68 I 
0.76 3.34 1 5.41 1 7.79 1 
0.99 1 3.37 1 5.72 1 7.91 1 
1.32 1 3.78 1 5.89 1 7.92 1 
1.59 1 3.90 1 5.98 1 8.09 1 
1.63 1 4.11 1 6.12 1 9.81 1 
2.29 1 4.19 1 6.26 1 10.15 1 
2.30 1 4.25 1 6.36 1 10.69 1 
2.46 1 4.36 1 6.41 1 11.02 1 
2.79 1 4.38 1 6.46 1 11.52 1 
2.98 1 4.42 1 6.62 1 12.34 1 
3.09 1 4.44 1 6.67 1 13.17 1 
3.10 1 4.72 1 7.28 2 17.05 1 
TABLE 2.3 


Histogram of Rainfall in August and September in Sayre, Oklahoma 


Rainfall Count Fraction of Sample 


0-1 0 0.000 
1-2 3 0.057 
2-3 3 0.057 
3-4 5 0.094 
4-5 7 0.132 
5-6 9 0.170 
6-7 4 0.075 
7-8 7 0.132 
8-9 6 0.113 
9-10 1 0.019 
10-11 1 0.019 
11-12 2 0.038 
12-13 2 0.038 
13-14 1 0.019 
14-15 1 0.019 


could one learn about the properties of the game and how could one 
quantify the uncertainty of our inferred knowledge of the properties? 
[47, p. 63] 


Intuitively, suppose that you were playing a board game such as backgammon 
and your opponent rolls doubles 6 out of 16 rolls, advancing around the board 
and winning the game. Could you conclude that the die were indeed fair? 
What is needed is a systematic formulation of how probability works. For 
example, Zellner [54] defines the direct probability as a probability model where 
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the form and nature of the probability structure are completely known (i.e., 
the probability of rolling a die). Given this formulation, the only thing that is 
unknown is the outcome of a particular roll of the dice. He contrasts this model 
with the inverse probability, where we observe the outcomes and attempt to 
say something about the probability generating process (see the Zellner on 
Probability box). 


Zellner on Probability 


On the other hand, problems usually encountered in science are 
not those of direct probability but those of inverse probability. That 
is, we usually observe data which are assumed to be the outcome 
or output of some probability process or model, the properties of 
which are not completely known. The scientist’s problem is to infer 
or learn the properties of the probability model from observed data, 
a problem in the realm of inverse probability. For example, we may 
have data on individual’s incomes and wish to determine whether 
they can be considered as drawn or generated from a normal proba- 
bility distribution or by some other probability distribution. Ques- 
tions like these involve considering alternative probability models 
and using observed data to try to determine from which hypothe- 
sized probability model the data probably came, a problem in the 
area of statistical analysis of hypotheses. Further, for any of the 
probability models considered, there is the problem of using data 
to determine or estimate the values of parameters appearing in it, 
a problem of statistical estimation. Finally, the problem of using 
probability models to make predictions about as yet unobserved 
data arises, a problem of statistical prediction [54, p. 69]. 


Three Views of the Interpretation of Probability 


Objectivistic: views hold that some repetitive events, such 
as tosses of a penny, prove to be in reasonably close agreement 
with the mathematical concept of independently repeated random 
events, all with the same probability. According to such views, ev- 
idence for the quality of agreement between the behavior of the 
repetitive event and the mathematical concept, and for the mag- 
nitude of the probability that applies (in case any does), is to be 
obtained by observation of some repetitions of the event, and from 
no other source whatsoever. 

Personalistic: views hold that probability measures the 
confidence that a particular individual has in the truth of a par- 
ticular proposition, for example, the proposition that it will rain 
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tomorrow. These views postulate that the individual concerned is 
in some ways “reasonable,” but they do not deny the possibility 
that two reasonable individuals faced with the same evidence may 
have different degrees of confidence in the truth of the same propo- 
sition. 

Necessary: views hold that probability measures the ex- 
tent to which one set of propositions, out of logical necessity and 
apart from human opinion, confirms the truth of another. They are 
generally regarded by their holders as extensions of logic, which 
tells when one set of propositions necessitates the truth of another 
(41, p. 3]. 


2.1 Two Definitions of Probability for Econometrics 


To begin our discussion, consider two fairly basic definitions of probability. 


e Bayesian — probability expresses the degree of belief a person has about 
an event or statement by a number between zero and one. 


e Classical — the relative number of times that an event will occur as the 
number of experiments becomes very large. 
= LO 


Jim_P [0] = 32. (2.1) 


The Bayesian concept of probability is consistent with the notion of a per- 
sonalistic probability advanced by Savage and de Fenetti, while the classical 
probability follows the notion of an objective or frequency probability. 

Intuitively, the basic concept of probability is linked to the notion of a 
random variable. Essentially, if a variable is deterministic, its probability is 
either one or zero — the result either happens or it does not (ie., if « = 
f (2) = 2 the probability that x = f (2) = 4 is one, while the probability that 
x = f (2) = 5 is zero). The outcome of a random variable is not certain. If x is 
a random variable it can take on different values. While we know the possible 
values that the variable takes on, we do not know the exact outcome before 
the event. For example, we know that flipping a coin could yield two outcomes 
—a head or a tail. However, we do not know what the value will be before 
we flip the coin. Hence, the outcome of the flip — head or tail — is a random 
variable. In order to more fully develop our notion of random variables, we 
have to refine our discussion to two general types of random variables: discrete 
random variables and continuous random variables. 
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A discrete random variable is some outcome that can only take on a 
fixed number of values. The number of dots on a die is a classic example of a 
discrete random variable. A more abstract random variable is the number of 
red rice grains in a given measure of rice. It is obvious that if the measure is 
small, this is little different from the number of dots on the die. However, if 
the measure of rice becomes large (a barge load of rice), the discrete outcome 
becomes a countable infinity, but the random variable is still discrete in a 
classical sense. 

A continuous random variable represents an outcome that cannot be 
technically counted. Amemiya [1] uses the height of an individual as an ex- 
ample of a continuous random variable. This assumes an infinite precision 
of measurement. The normally distributed random variable presented in Fig- 
ures 1.1 and 1.3 is an example of a continuous random variable. In our forego- 
ing discussion of the rainfall in Sayre, Oklahoma, we conceptualized rainfall 
as a continuous variable while our measure was discrete (i.e., measured in a 
finite number of hundreths of an inch). 

The exact difference between the two types of random variables has an 
effect on notions of probability. The standard notions of Bayesian or Classical 
probability fit the discrete case well. We would anticipate a probability of 
1/6 for any face of the die. In the continuous scenario, the probability of any 
specific outcome is zero. However, the probability density function yields 
a measure of relative probability. The concepts of discrete and continuous 
random variables are then unified under the broader concept of a probability 
density function. 


2.1.1 Counting Techniques 


A simple method of assigning probability is to count how many ways an event 
can occur and assign an equal probability to each outcome. This methodology 
is characteristic of the early work on objective probability by Pascal, Fermat, 
and Huygens. Suppose we are interested in the probability that a die roll will 
be even. The set of all even events is A = {2,4,6}. The number of even events 
is n (A) = 3. The total number of die rolls is S = {1,2,3,4,5,6} or n(S) =6. 
The probability of these countable events can then be expressed as 


Plas (2.2) 


where the probability of event A is simply the number of possible occurrences 
of A divided by the number of possible occurrences in the sample, or in this 
example P [even die rolls] = 3/6 = 0.50. 


Definition 2.1. The number of permutations of taking r elements out of n 
elements is a number of distinct ordered sets consisting of r distinct elements 
which can be formed out of a set of n distinctive elements and is denoted P,”. 
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The first point to consider is that of factorials. For example, if you have 
two objects A and B, how many different ways are there to order the object? 
Two: 

{A, B} or {B, A}. (2.3) 


If you have three objects, how many ways are there to order the objects? Six: 


{A, B,C} {A, C, B} {B, A, C} {B,C,A} (2.4) 
{C,A,B} or {C,B, A}. , 
The sequence then becomes — two objects can be drawn in two sequences, 
three objects can be drawn in six sequences (2 x 3). By inductive proof, four 
objects can be drawn in 24 sequences (6 x 4). 

The total possible number of sequences is then for n objects n! defined as: 


n!=n(n—1)(n—2)...1. (2.5) 


Theorem 2.2. The (partial) permutation value can be computed as 


(2.6) 


The term partial permutation is sometimes used to denote the fact that we 
are not completely drawing the sample (i.e., r <n). For example, consider 
the simple case of drawing two out of two possibilities: 


j 2! 


P2 = O== 2 (2.7) 


which yields the intuitive result that there are two possible values of drawing 
one from two (i.e., either A or B). If we increase the number of possible 
outcomes to three, we have 
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OP ae ae Bs 
P= @Bxi)I 


=3 (2.8) 

which yields a similarly intuitive result that we can now draw three possible 

first values A, B, or C. Taking the case of three possible outcomes one step 
further, suppose that we draw two numbers from three possibilities: 

3! 6 

P3 = ——__ =_=6. 2.9 

2" (3-2)! 1 ey) 

Table 2.4 presents these results. Note that in this formulation order matters. 

Hence {A,B} 4 {B, A}. 
To develop the generality of these formulas, consider the number of per- 
mutations for completely drawing four possible numbers (i.e., 4! = 24 possible 
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TABLE 2.4 
Partial Permutation of Three Values 


Low First High First 


TABLE 2.5 
Permutations of Four Values 
A First B First C First D First 

1. {A,B,C,D} {B, A, C, D} {C, A, B, D} {D, A, C, B} 
2 {A, B, D, Ch {B, A, D, Ch} {C,A, D, Bh {D, A, B,C} 
3 {A, C,; B, D} {B, C," A, D} {C, B,A, D} {D, B,A, Ch} 
4 {A, C," D, Bh {B, C," D, A} {C, B, D, A} {D, B, C,; A} 
5 {A, D, B, Ch {B, D,A, Ch} {C,D, A, Bh {D, C,; A, B} 
6 {A, D, C," Bh {B, D, C," A} {C, D, B, A} {D, C,; B, A} 


sequences, as depicted in Table 2.5). How many ways are there to draw the 
first number? 


P. — — 

2 SING 
The results seem obvious — if there are four different numbers, then there are 
four different numbers you could draw on the first draw (i.e., see the four 
columns of Table 2.5). Next, how many ways are there to draw two numbers 
out of four? 


A! 24 
‘ == (2.10) 


4 4! 24 
Py = @—ay1- 2 12. (2.11) 
To confirm the conjecture in Equation 2.11, note that Table 2.5 is grouped 
by combinations of the first two numbers. Hence, we see that there are three 
unique combinations where A is first (i.e., {A, B}, {A,C}, and {A, D}). Given 
that the same is true for each column 4 x 3 = 12. 

Next, consider the scenario where we don’t care which number is drawn 
first - {A,B} = {B,A}. This reduces the total number of outcomes pre- 
sented in Table 2.4 to three. Mathematically we could say that the number of 
outcomes K could be computed as 


_ PF 3! 


K= = _=3, 2.12 
2 1! x 2! : ( ) 


Extending this result to the case of four different values, consider how many 
different outcomes there are for drawing two numbers out of four if we don’t 
care about the order. From Table 2.5 we can have six (i.e., {A, B} = {B, A}, 
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{A, C} = ee Ay, {A, D} = {D, A}, 1:8, C} = {C, B}, {B, D} = 1D, Bt, 
and {C,D} ={D,C}). Again, we can define this figure mathematically as 
Py 4! 


K= 


y= Ga Du (2.13) 


This formulation is known as a combinatorial. A more general form of the 
formulation is given in Definition 2.3. 


Definition 2.3. The number of combinations of taking r elements from n 
elements is the number of distinct sets consisting of r distinct elements which 
can be formed out of a set of n distinct elements and is denoted C7’. 


Cm = ( : ) oe (2.14) 


(n—r)!r! 
Apart from their application in probability, combinatorials are useful for 
binomial arithmetic. 


(a+b)" = % ( i: ) ake, (2.15) 


k=0 


Taking a simple example, consider (a +b)”. 


(a+b)° = ( : ) aon ( ; ) anor : ) amar ( : ) aloes 


(2.16) 
Working through the combinatorials, Equation 2.16 yields 


(a+b)? = a? + 3a7b + 3ab? + 8 (2.17) 


which can also be drived using Pascal’s triangle, which will be discussed in 
Chapter 5. As a direct consequence of this formulation, combinatorials al- 
low for the extension of the Bernoulli probability form to the more general 
binomial distribution. 

To develop this more general formulation, consider the example from 
Bierens [5, Chap. 1]; assume we are interested in the game Texas lotto. In 
this game, players choose a set of 6 numbers out of the first 50. Note that the 
ordering does not count so that 35, 20, 15, 1,5, 45 is the same as 35, 5, 15, 
20, 1, 45. How many different sets of numbers can be drawn? First, we note 
that we could draw any one of 50 numbers in the first draw. However, for the 
second draw we can only draw 49 possible numbers (one of the numbers has 
been eliminated). Thus, there are 50 x 49 different ways to draw two numbers. 
Again, for the third draw, we only have 48 possible numbers left. Therefore, 
the total number of possible ways to choose 6 numbers out of 50 is 


5 50 50 
k 50! 
(50 — k= == 2.18 
I ” ~ Es ok ~ (50 — 6)! 
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Finally, note that there are 6! ways to draw a set of 6 numbers (you could 
draw 35 first, or 20 first, ...). Thus, the total number of ways to draw an 
unordered set of 6 numbers out of 50 is 


50 50! 
= ——— _ = 15,890,700. 2.19 
( 6 ) 6!(50 — 6)! » a ( ) 

This description of lotteries allows for the introduction of several defini- 
tions important to probability theory. 


Definition 2.4. Sample space The set of all possible outcomes. In the Texas 
lotto scenario, the sample space is all possible 15,890,700 sets of 6 numbers 
which could be drawn. 


Definition 2.5. Event A subset of the sample space. In the Texas lotto 
scenario, possible events include single draws such as {35, 20, 15,1,5,45} or 
complex draws such as all possible lotto tickets including {35, 20, 15}. Note 
that this could be {35, 20, 15, 1, 2,3}, (35, 20, 15, 1,2,4},.... 


Definition 2.6. Simple event An event which cannot be expressed as a union 
of other events. In the Texas lotto scenario, this is a single draw such as 
{35, 20, 15,1,5,45}. 


Definition 2.7. Composite event An event which is not a simple event. 


Formal development of probability requires these definitions. The sample space 
specifies the possible outcomes for any random variable. In the roll of a die 
the sample space is {1,2,3,4,5,6}. In the case of a normal random variable, 
the sample space is the set of all real numbers x € (—oo, co). An event in the 
roll of two dice could be the number of times that the values add up to 4 — 
{1,3} , {2,2}, {3,1}. The simple event could be a single dice roll for the two 
dice — {1,3}. 


2.1.2. Axiomatic Foundations 


In our gaming example, the most basic concept is that each outcome s; = 
1, 2,3, 4, 5,6 is equally likely in the case of the six-sided die. Hence, the prob- 
ability of each of the events is P[s;] = 1/6. That is, if the die is equally 
weighted, we expect that each side is equally likely. Similarly, we assume that 
a coin landing heads or tails is equally likely. The question then arises as to 
whether our framework is restricted to this equally likely mechanism. Sup- 
pose we are interested in whether it is going to rain tomorrow. At one level, 
we could say that there are two events — it could rain tomorrow or not. Are we 
bound to the concept that these events are equally likely and simply assume 
that each event has a probability of 1/2? Such a probability structure would 
not make a very good forecast model. 

The question is whether there is a better way to model the probability 
of raining tomorrow. The answer is yes. Suppose that in a given month over 
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TABLE 2.6 
Outcomes of a Simple Random Variable 

Sample Samples 

Draw 1 2 3 
1 1 0 0 
2 0 0 1 
3 1 0 1 
4 0 0 1 
5 1 0 0 
6 0 1 0 
7 1 1 1 
8 1 0 1 
9 1 1 0 
10 1 0 1 
11 0 0 1 
12 0 0 1 
13 0 1 0 
14 1 1 1 
15 0 1 0 
16 1 1 1 
17 1 0 0 
18 1 1 1 
19 1 1 1 
20 1 1 1 
21 1 0 ak 
22 0 0 0 
23 1 1 1 
24 1 1 0 
25 1 1 0 
26 1 1 1 
27 1 1 0 
28 0 1 1 
29 1 1 1 
30 1 0 1 

Total 21 17 19 


Percent 0.700 0.567 0.633 


the past thirty years that it rained five days. We could conceptualize a game 
of chance, putting five black marbles and twenty five white marbles into a 
bag. Drawing from the bag with replacment (putting the marble back each 
time) could be used to represent the probability of raining tomorrow. Notice 
that the chance of drawing each individual marble remains the same — like 
the counting exercise at 1/30. However, the relative difference is the number 
of marbles in the sack. It is this difference in the relative number of marbles 
in the sack that yields the different probability measure. 
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TABLE 2.7 
Probability of the Simple Random Sample 


Observation Draw Probability 


1 1 p 
2 0 1l-p 
3 1 Dp 
4 0 1l-p 
5 1 Dp 


It is the transition between these two concepts that gives rise to more 
sophisticated specifications of probability than simple counting mechanics. 
For example, consider the blend of the two preceding examples. Suppose that 
I have a random outcome that yields either a zero or a one (heads or tails). 
Suppose that I want to define the probability of a one. As a starting place, I 
could assign a probability of 1/2 — equal probability. Consider the first column 
of draws in Table 2.6. The empirical evidence from these draws yields 21 ones 
(heads) or a one occurs 0.70 of the time. Based on this draw, would you 
agree with your initial assessment of equal probability? Suppose that we draw 
another thirty observations as depicted in column 2 of Table 2.6. These results 
yield 17 heads. In this sample 57 percent of the outcomes are ones. This sample 
is closer to equally likely, but if we consider both samples we have 38 ones out 
of 60 or 63.3 percent heads. 

The question is how to define a set of common mechanics to compare the 
two alternative views (i.e., equally versus unequally likely). The mathematical 
basis is closer to the marbles in the bag than to equally probable. For example, 
suppose that we define the probability of heads as p. Thus, the probability of 
drawing a white ball is p while the probability of drawing a black ball is 1 — p. 
The probability of drawing the first five draws in Table 2.6 are then given in 
Table 2.7. 

As a starting point, consider the first event. To rigorously develop a notion 
of the probability, we have to define the sample space. To define the sample 
space, we define the possible events. In this case there are two possible events 
—0or 1 (or E = lorO). The sample space defined on these events can then 
be represented as S = {0,1}. Intuitively, if we define the probability of E = 1 
as p, then by definition of the sample space the probability of FE = 0 is 1 —p 
because one of the two events must occur. Several aspects of the last step 
cannot be dismissed. For example, we assume that one and only one event 
must occur — the events are exclusive (a 0 and a 1 cannot both occur) and 
exhaustive (either a 0 or a 1 must occur). Thus, we denote the probability of 
a 1 occurring to be p and the probability of 0 occurring to be q. If the events 
are exclusive and exhaustive, 


p+q=1>q=1-p (2.20) 
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because one of the events must occur. In addition, to be a valid probability 
we need p > 0 and 1— p> 0. This is guaranteed by p € [0, 1]. 

Next, consider the first two draws from Table 2.7. In this case the sample 
space includes four possible events — {0,0}, {0,1}, {1,0}, and {1, 1}. Typically, 
we aren’t concerned with the order of the draw so {0,1} = {1,0}. However, we 
note that there are two ways to draw this event. Thus, following the general 
framework from Equation (2.20), 


2 
2 — r 
et). g? =p + 2pqtq? > p?+2p(L—p)+(1—p). (2.21) 
r=0 


To address the exclusive and exhaustive nature of the event space, we need to 
guarantee that the probabilities sum to one — at least one event must occur. 


p?+2p(1—p)+(1— p)* =p? + 2p— 2p? +1-2p+p? =1. (2.22) 


In addition, the restriction that p € [0,1] guarantees that each probability is 
positive. 
By induction, the probability of the sample presented in Table 2.7 is 


P[S|p] = p* (1—p)” (2.23) 
for a given value of p. Note that for any value of p € [0, 1] 


5 


os ( ° ) vo (1—p)" =1, (2.24) 


r=1 


or a valid probability structure can be defined for any value p on the sample 
space. 

These concepts offer a transition to a more rigorous way of thinking about 
probability. In fact, the distribution functions developed in Equations 2.20 
through 2.24 are typically referred to as Bernoulli distributions for Jacques 
Bernoulli, who offered some of the very first rigorous proofs of probability 
(16, pp. 143-164]. This rigorous development is typically refered to as an 
axiomatic development of probability. The starting point for this axiomatic 
development is set theory. 


Subset Relationships 


As described in Definitions 2.4, 2.5, 2.6 and 2.7, events or outcomes of random 
variables are defined as elements or subsets of the set of all possible outcomes. 
Hence, we take a moment to review set notation. 


(a) ACBSxEeASveB. 


(b) A=BSAc BandBCcaA. 
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(c) Union: The union of A and B, written AU B, the set of elements that 
belong either to A or B. 


AUB={ax:x€ Aorze B}. (2.25) 


(d) Intersection: The intersection of A and B, written AN B, is the set of 
elements that belong to both A and B. 


ANB={x:x€ Bandze B}. (2.26) 


(e) Complementation: The complement of A, written A©, is the set of all 
elements that are not in A. 


AC e{a:a¢ A}. (2.27) 


Combining the subset notations yields Theorem 2.8. 


Theorem 2.8. For any three events A, B, and C defined on a sample space 
S, 


a) Commutativity: AUB= BUA, ANB=BNA. 
b) Associativity: AU(BUC) =(AUB)UC, AN(BUC) =(AUB)UC. 


c) Distributive Laws: AN (BUC) = (AUB)N (AUC), AU(BNC) = 
(AN B)U(ANC). 


d) DeMorgan’s Laws: (AU B)° SAT WBE CAT B)° SACL BS: 


Axioms of Probability 


A set {w;,,...W,;,} of different combinations of outcomes is called an event. 
These events could be simple events or compound events. In the Texas lotto 
case, the important aspect is that the event is something you could bet on (for 
example, you could bet on three numbers in the draw 35, 20, 15). A collection 
of events F is called a family of subsets of sample space 2. This family consists 
of all possible subsets of 2 including (2 itself and the null set 0. Following the 
betting line, you could bet on all possible numbers (covering the board) so 
that Q is a valid bet. Alternatively, you could bet on nothing, or @ is a valid 
bet. 

Next, we will examine a variety of closure conditions. These are conditions 
that guarantee that if one set is contained in a family, another related set 
must also be contained in that family. First, we note that the family is closed 
under complementarity: If A €¢ F then Ac € Q|A€ F. In this case A° € 
Q|A € F denotes all elements of 2 that are not contained in A (i.e., AS = 
{x:2€Q|x € A}). Second, we note that the family is closed under union: If 
A,BeF then AUBE F. 
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Definition 2.9. A collection F of subsets of a nonempty set 2 satisfying 
closure under complementarity and closure under union is called an algebra 
[5). 


Adding closure under infinite union is defined as: If A; € F for 7 = 1, 2,3... 
then Uf2, A; € F. 


Definition 2.10. A collection F of subsets of a nonempty set 2 satisfying 
closure under complementarity and infinite union is called a o-algebra (sigma- 
algebra) or a Borel Field [5]. 


Building on this foundation, a probability measure is the measure which 
maps from the event space into real number space on the [0,1] interval. We 
typically think of this as an odds function (i.e., what are the odds of a winning 
lotto ticket? 1/15,890,700). To be mathematically precise, suppose we define a 
set of events A = {w},...w;} € Q, for example, we choose n different numbers. 
The probability of winning the lotto is P[A] = n/N. Our intuition would 
indicate that P[Q] = 1, or the probability of winning given that you have 
covered the board is equal to one (a certainty). Further, if you don’t bet, the 
probability of winning is zeros or P [9] = 0. 


Definition 2.11. Given a sample space 2 and an associated o-algebra F’, a 
probability function is a function P [A] with domain F that satisfies 


e P(A) >0 for all AE F. 
e P(Q)=1. 


FIGURE 2.1 
Mapping from Event Space to Probability Space. 


38 Mathematical Statistics for Applied Econometrics 


e If Aj, Ag,... € F are pairwise disjoint, then P (U%,) = S772, P (Ai). 

Breaking this down a little at a time, P [A] is a probability measure that is 
defined on an event space. The concept of a measure will be developed more 
fully in Chapter 3, but for our current uses, the measure assigns a value to an 
outcome in event space (see Figure 2.1). This value is greater than or equal 
to zero for any outcome in the algebra. Further, the value of the measure for 
the entire sample space is 1. This implies that some possible outcome will 
occur. Finally, the measure is additive over individual events. This definition 
is related to the required axioms of probability 


U4 
i=1 


Stated slightly differently, the basic axioms of probability are: 


P 


= yp [Aj]. (2.28) 


Definition 2.12. Axioms of Probability: 
1. P[A] > 0 for any event A. 
2. P[S] = 1 where S' is the sample space. 


3. If {A}i =1,2,... are mutually exclusive (that it A;7.A; = 0 for all i 4 J, 
then P[A,M A2N...] =P [Ai] +P [42] +---. 


Thus, any function obeying these properties is a probability function. 


Da 


2.2 What Is Statistics? 


Given an understanding of random variables and probability, it is possible to 
offer a definition of statistics. Most definitions of statistics revolve around the 
synthesis of data into a smaller collection of numbers that contain the mean- 
ingful information in the sample. Here we consider three standard definitions 
of statistics. 


Definition 2.13. Statistics is the science of assigning a probability of an 
event on the basis of experiments. 


Definition 2.14. Statistics is the science of observing data and making in- 
ferences about the characteristics of a random mechanism that has generated 
the data. 


Definition 2.15. Statistics is the science of estimating the probability distri- 
bution of a random variable on the basis of repeated observations drawn from 
the same random variable. 
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These definitions highlight different facets of statistics. Each definition 


contains a notion of probability. Two of the definitions make reference to 
estimation. As developed in Kmenta’s [26] definition in Chapter 1, estimation 
can imply description or inference. In addition, one of the definitions explicitly 
states that statistics deals with experiments. 


2.3. Chapter Summary 


Probability is a primary building block of statistical inference. 


Most of the early development of probability theory involved games of 
chance (aleatoric, from alea, a dice game). 


In general, probability theory can be justified using three approaches: 


— Frequency — based on the relative number of times an event occurs. 


— Personal Probability — based on personal belief. This approach is 
similar to the construction of utility theory. 


— Axiomatic Probability — based on the mathematics of measure 
theory. 


Most econometric applications involve the estimation of unknown parame- 
ters. As developed by Zellner, a direct probability is a probability model 
where we know the probability structure completely. For example, we know 
the probability structure for the unweighted die. An alternative approach 
is the inverse probability formulation, where we observe the outcomes 
of the probability model and wish to infer something about the true na- 
ture of the probability model. Most econometric applications involve an 
inverse probability formulation. 


While there are three different approaches to probability theory, most 
econometric applications are interested in two broad categories of proba- 
bility: 


— Bayesian — the probability structure based on the degree of belief. 


— Classical — where we are interested in the empirical or observed 
frequency of events. 


The concept of a probability is related to the notion of a random variable. 
This concept is best described by contrasting the notion of a deterministic 
outcome with a random outcome. 


— We always know the outcome of a deterministic process (or function). 


AO Mathematical Statistics for Applied Econometrics 


— The outcome of a random variable may take on at least two different 
values. The exact outcome is unknowable before the event occurs. 


— Counting techniques provide a mechanism for developing classical 
probabilities. These models are related to the frequency approach. 


— The Sample Space is the set of all possible outcomes. 


— An Event can either be simple (i.e., containing a single outcome) or 
composite (i.e., including several simple events). The event of an even 
numbered die roll is complex. It contains the outcomes s = {2, 4,6}. 


e The axiomatic development of probability theory allows us to generalize 
models of random events which allow for tests of consistency for random 
variables. 


2.4 Review Questions 


2-1R. What are the three different approaches to developing probability? 


2-2R. Consider two definitions of the same event — whether it rains tomor- 
row. We could talk about the probability that it will rain tomorrow or 
the rainfall observed tomorrow. Which event is more amenable to the 
frequency approach to probability? 


2-3R. Consider two physical mechanisms — rolling an even number given a six- 
sided die and a flipped coin landing heads up. Are these events similar 
(ie., do they have the same probability function)? 


2.5 Numerical Exercises 


2-1E. What is the probability that you will roll an even number given the 
standard six-sided die? 

a. Roll the die 20 times. Did you observe the anticipated number of even 
numbered outcomes? 

b. Continue rolling until you have rolled the die 40 times. Is the number 


of outcomes closer to the theoretical number of even-numbered rolls? 


2-2E. Continuing from Exercise 2-1E, what is the probability of rolling a 2 or 
a 4 given a standard six-sided die? 
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2-3E. 


2-4E. 


2-5E. 
2-6E. 


a. Roll the die 20 times. Did you observe the anticipated number of 2s 
and 4s? 


b. Continue rolling until you have rolled the die 40 times. Is the number 
of outcomes closer to the theoretical number of rolls? 


c. Are the rolls of 2 or 4 closer to the theoretical results than in Exercise 
2-1E? 


Construct an empirical model for rainfall in the October-December 
time period using a frequency approach using intervals of 1.0 inch of 
rainfall. 


How many ways are there to draw 2 events from 5 possibilities? Hint: 
S={A,B,C,D, EF}. 


What is the probability of s = {C, D} when the order is not important? 
Consider a random variable constructed by rolling a six-sided die and 
flipping a coin. Taking x = {1, 2,3, 4,5, 6} to be the outcome of the die 


roll and y = {lif heads, —1if tails} to be the outcome of the coin toss, 
construct the random variable z = x x y. 


— What is the probability that the value of y will be between —2 and 
2? 


— What is the probability that the value of y will be greater than 2? 


3 
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Much of the development of probability in Chapter 2 involved probabilities in 
the abstract. We briefly considered the distribution of rainfall, but we were 
largely interested in flipping coins or rolling dice. These examples are typically 
referred to as aleatoric — involving games of chance. In this chapter we develop 
somewhat more complex versions of probability which form the basis for most 
econometric applications. We will start by developing the uniform distribution 
that defines a frequently used random variable. Given this basic concept, we 
then develop several probability relationships. We then discuss more general 


specifications of random variables and their distributions. 
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3.1 Uniform Probability Measure 


I think that Bierens’s [5] discussion of the uniform probability measure pro- 
vides a firm basis for the concept of a probability measure. First, we follow 
the conceptual discussion of placing ten balls numbered 0 through 9 into a 
container. Next, we draw an infinite sequence of balls out of the container, 
replacing the ball each time. In Excel™, we can mimic this sequence using 
the function floor(rand()*10,1). This process will give a sequence of random 
numbers such as presented in Table 3.1. Taking each column, we can gener- 
ate three random numbers: {0.741483, 0.029645, 0.302204}. Note that each 
of these sequences is contained in the unit interval Q = [0,1]. The primary 
point of the demonstration is that the number drawn {a € 2 = [0,1]} is a 
probability measure. Taking x = 0.741483 as the example, we want to prove 
that P ((0,« = 0.741483]) = 0.741483. To do this we want to work out the 
probability of drawing a number less than 0.741483. As a starting point, what 
is the probability of drawing the first number in Table 3.1 less than 7? It 
is 7 ~ {0,1,2,3,4,5,6}. Thus, without considering the second number, the 
probability of drawing a number less than 0.741483 is somewhat greater than 
7/10. Next, we consider drawing a second number given that the first number 
drawn is greater than or equal to 7. As a starting point, consider the scenario 
where the number drawn is equal to seven. This occurs 1/10 of the time. 
Note that the two scenarios are disjoint. If the first number drawn is less than 
seven, it is not equal to seven. Thus, we can rely on the summation rule of 
probabilities: 


If A;n A; — (then P U Ar 


k=1 


= SP [A;] (3.1) 
k=1 


The probability of drawing a number less than 0.74 is the sum of drawing 
the first number less than 7 and the second number less than 4 given that 
the first number drawn is 7. The probability of drawing the second number 
less than 4 is 4/10 ~ {0,1, 2,3}. Given that the first number equal to 7 only 


TABLE 3.1 
Random Draws of Single Digits 


Ball Drawn Draw 1 Draw 2. Draw 3 
3 


aounkwnNnr 
womrABnA 
AR AONO 
RONNO 
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occurs 1/10 of the time, the probability of the two events is 


4 4 1 7 4 
P (0,2 = 0.74]) = 10 + Io (=) =o" i009 = 0°74 (3.2) 
Continuing to iterate this process backward, we find that P ([0, « = 0.741483]) = 
0.741483. Thus, for x € [0,1] we have P ((0,x]) = x. 

Before we complete Bieren’s discussion, let us return to Definition 2.12. 
We know that a function meets our axioms for a probability function if (1) the 
value of the function is non-negative for any event, (2) the probability of the 
sample space is equal to one, and (3) if a set of events are mutually exclusive 
their probabilities are additive. For our purposes in econometrics, it is typically 
sufficient to conceptualize probability as a smooth function F' (11,22) where 
x1 and £2 are two points defined on the real number line. Given that we define 
an event w such that w => x € [21,x2], the probability of w is defined as 


F |a1, v9] = f (x) dx (3.3) 
where f (x) > 0 for all  € X implied by 2 (where w € 2) and 


i f (x) dx =1. (3.4) 
xX 


Hence, given that f (a7) > 0Va € X we meet the first axiom of Definition 2.12. 
Second, the definition of f (x) in Equation 3.4 implies the second axiom. And, 
third, given that we can form mutually exclusive events by partitioning the 
real number line, this specificaiton meets the third axiom. Thus, the uniform 
distribution defined as 


U (0, 1] > f (x) = 1 forz € [0, 1] and 0 otherwise (3.5) 


meets the basic axioms for a valid probability function. 

While the definition of a probabilty function in Equation 3.3 is sufficient for 
most econometric applications, more rigorous proofs and formulations are fre- 
quently developed in the literature. In order to understand these formulations, 
consider the Reimann sum, which is used in most undergraduate calculus texts 
to justify the integral 


[i F) de = in pen eal £24) R= Le (WYK, (3.6) 


Taking some liberty with the mathematical proof, the concept is that as the 
number of intervals kK becomes large, the interval of approximation x, — 1 
becomes small and the Riemann sum approaches the antiderivative of the 
function. In order to motivate our development of the Lebesgue integral, con- 
sider a slightly more rigorous specification of the Riemann sum. Specifically, 
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we define two formulations of the Riemann sum: 


S1(K) = 50 lap — tp-1| sup f(z) 


[Te—1,0k] 


k 
S2(K) = 5° |e —2y-1|__ inf f@) 
k 


[e-1,%k 


(3.7) 


where 5; (K) is the upper bound of the integral and Sj (K) is the lower 
bound of the integral. Notice that sup (supremum) and inf (infimum) are 
similar to max (maximum) and min (minimum). The difference is that the 
maximum and minimum values are inside the range of the function while 
the supremum and infimum may be limits (i.e., greatest lower bounds and 
least upper bounds). Given the specification in Equation 3.7 we know that 
S; (K) > So (K). Further, we can define a residual 


e(K) = S, (K) — $9 (K). (3.8) 


The real value of the proof is that we can make ¢(K) arbitrarily small by 
increasing I. 

As complex as the development of the Riemann sum appears, it is simpli- 
fied by the fact that we only consider simple intervals on the real number line. 
The axioms for a probability (measure) defined in Definitions 2.9 and 2.10 
refer to a sigma-algebra. As a starting point, we need to develop the concept 
of a measure a little more rigorously. Over the course of a student’s education, 
most become so comfortable with the notion of measuring physical phenom- 
ena that they take it for granted. However, consider the problem of learning 
to count, basically the development of a number system. Most of us were in- 
troduced to the number system by counting balls or marbles. For example, 
in Figure 3.1 we conceptualize a measure (j:(A)), defined as the number of 
objects in set S). In this scenario, the defined set is the set of all objects A 
so the measure can be defined as 


(A) > RL => (A) = 18. (3.9) 


Notice that if we change the definition of the set slightly, the mapping changes. 
Instead of A, suppose that I redefined the set as that set of circles — B. The 
measure then becomes ps (B) = 15. 

Implicitly, this counting measure has several imbedded assumptions. For 
example, the count is always positive. (actually the count is always a natural 
number). In addition, the measure is additive. I can divide the sets in S into 
circles (set B) and triangles (C, which has a count of u(C) = 3). Note that 


1To be terribly precise, the count function could be defined as 
u(A)= 01 
i€cA 


where n(A) = (A) from Chapter 2. 
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FIGURE 3.1 
Defining a Simple Measure. 


the set of circles and triangles is mutually exclusive — no element in the set is 
both a circle and a triangle. A critical aspect of a measure is that it is additive. 


w(BUC)=pn(B)+u(C) => 154+3=18. (3.10) 


Stochastic Process — Doob 


The theory of probability is concerned with the measure proper- 
ties of various spaces, and with the mutual relations of measurable 
functions defined on those spaces. Because of the applications, it is 
frequently (although not always) appropriate to call these spaces 
sample spaces and their measurable sets events, and these terms 
should be borne in mind in applying the results. 

The following is the precise mathematical setting.... It is sup- 
posed that there is some basic space 2, and a certain basic collec- 
tion of sets of points of 2. These sets will be called measurable sets; 
it is supposed that the class of measurable sets is a Borel field. It is 
supposed that there is a function P {.}, defined for all measurable 
sets, which is a probability measure, that is, P {.} is a completely 
additive non-negative set function, with value 1 on the whole space. 
The number P {A} will be called the probability or measure of A 
[11, p. 2]. 


The concept of a measure is then a combination of the set on which the mea- 
sure is defined and the characteristics of the measure itself. In general, the 
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properties of a set that make a measure possible are defined by the charac- 
teristics of a Borel set, which is a specific class of o-algebras. To build up 
the concept of a o-algebra, let us start by defining an algebra. Most students 
think of algebra as a freshman math course, but an (abstract) algebra can be 
defined as 


Stochastic Methods in Economics and Finance — Malliaris 


Let 2 be an arbitrary space consisting of points w. Certain classes 
of subsets of Q are important in the study of probability. We now 
define the class of subsets called o-field or g-algebra. denoted F’.... 
We call the elements of F’ measurable sets [30, p. 2]. 


Definition 3.1. A collection of subsets A defined on set X is said to be an 
algebra in X if A has the following properties. 


i. X EA. 
ii. XC € A> X° is the complement of X relative to A. 


iii. If A, Be Athen AUBEX [6, p. 7]. 


Notice that the set defined in Figure 3.1 is an algebra. 


Definition 3.2. A collection of sets M of subsets X is said to be a o-algebra 
in X if A, € M for all n € N,; then UL, A, € M [6, p. 7]. 


In Defintion 3.2 A, is a sequence of subsets and N denotes the set of all 
natural numbers. Basically, a o-algebra is an abstract algebra that is closed 
under infinite union. In the case of the set depicted in Figure 3.1, the set of 
all possible unions is finite. However, the algebra is closed under all possible 
unions so it is a o-algebra. The Borel set is then the o-algebra that contains 
the smallest number of sets (in addition, it typically contains the whole set X 
and the null set). 

Given this somewhat lengthy introduction, we can define the o-algebra 
and Borel set for Bierens’s [5] discussion as 


Definition 3.3. The o-algebra generated by the collection 
C = {(a,b): Va <b, a,b Ee R} (3.11) 


of all open intervals in Re is called the Euclidean Borel field, denoted B, and 
its members are called Borel sets. 


In this case, we have defined a = 0 and b= 1. 
Also, note that for any x € [0,1], P[{z} ] = P([z,2]) = 0. This has the 
advantage of eliminating the lower end of the range. Specifically, 


P ((0, z]) = P ((0]) + P ((0, z]). (3.12) 
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Further, for a < b, a,b € [0,1] 


P ({a, b]) = P ((a, b]) = P ([a, b)) = P ((a,b)) = b— a. (3.13) 
In the Bierens formulation 


Fo = {(a,) : [a,b], (a, 6], [a, b) , Va, b € [0, 1],a < 8, 


and their countable union} (3.14) 


This probability measure is a special case of the Lebesgue measure. 

Building on the uniform distribution, we next define the Lebesgue measure 
as a function \ that measures the length of the interval (a,b) on any Borel set 
Bin R. 


Co 


y= inf Boa: ‘Ga 
a= BCUX ae ake a4, By = peste) 2 ane 


It is the total length of the Borel set taken from the outside. Based on the 
Lebesgue measure, we can then define the Lebesgue integral based on the 
basic definition of the Reimann integral. 


[re Jae = sup > (in f(z ))a (ha (3.16) 


Note that the result in Equation 3.16 is similar in concept to the simple forms 
of the Riemann sum presented in Equation 3.6. Replacing the interval of the 
summation, the Lebegue integral becomes 


/ f (x) dx = sup ss (int F(e)) MBS (3.17) 
“A m=1 


Hence, the probability measure in Equation 3.17 is a more general version of 
the integral. 

Specifically, the real number line which forms the basis of the Riemann 
sum is but one of the possible Borel sets that we can use to construct a 
probability measure. In order to flush out this concept, consider measuring 
the pH (acidity) of a pool using Phenol red. Phenol red changes color with 
the level of pH, ranging from yellow when the water is relatively acidic to a 
red/purple when the water is alkaline. This concept of acidity is mapped onto 
a pH scale running from 6.8 when the result is yellow to 8.2 when the result 
is red/purple. This mapping is constructed using a kit which tells me the pH 
level associated with each color. In this case we have a measure outside the 
typical concept of counting — it does not necessarily fit on the real number line. 
The mechanics given above insure that we can develop a probability measure 
associated with the pH level (i.e., without first mapping through the numeric 
pH measure). 
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3.2. Random Variables and Distributions 


Now we have established the existence of a probability measure based on a 
specific form of o-algebras called Borel fields. The question is, can we extend 
this rather specialized formulation to broader groups of random variables? 
Of course, or this would be a short textbook. As a first step, let’s take the 
simple coin-toss example. In the case of a coin there are two possible outcomes 
(heads or tails). These outcomes completely specify the sample space. To add 
a little structure, we construct a random variable X that can take on two 
values X = Oorl (as depicted in Table 2.6). If X = 1 the coin toss resulted 
in a head, while if X = 0 the coin toss resulted in a tail. Next, we define each 
outcome based on an event space w: 


P(X =1)=P({weN: X(w)= 


1} 
P(X =0) =P({w EQ: X (w) = 0} ) =P([T)). (3.18) 


In this case the physical outcome of the experiment is either a head (w = 
heads) or a tail (w = tails). These events are “mapped into” number space — 
the measure of the event is either a zero or a one. 

The probability function is then defined by the random event w. Defining 
w as a uniform random variable from our original example, one alternative 
would be to define the function as 


X (w) = 1ifw < 0.50. (3.19) 


This definition results in the standard 50-50 result for a coin toss. However, 
it admits more general formulations. For example, if we let 


X (w) =1ifw < 0.40 (3.20) 


the probability of heads becomes 40 percent. 
Given this intuition, the next step is to formally define a random variable. 
Three alternative definitions should be considered 


Definition 3.4. A random variable is a function from a sample space S$ into 
the real numbers. 


Definition 3.5. A random variable is a variable that takes values according 
to a certain probability. 


Definition 3.6. A random variable is a real-valued function defined over a 
sample space. 


In this way a random variable is an abstraction. We assumed that there was 
a random variable defined on some sample space like flipping a coin. The 
flipping of the coin is an outcome in an abstract space (i.e., a Borel set). 


S= {81, 82,°°-* Sn}. (3.21) 
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We then define a numeric value to this set of random variables. 


X:S—> R} X (w):Q- R} 
eis X te. ee Lys xX (wi) i (3.22) 
There are two ways of looking at this tranformation. First, the Borel set is 
simply defined as the real number line (remember that the real number line 
is a valid Borel set). Alternatively, we can view the transformation as a two 
step mapping. For example, a measure can be used to define the quantity of 
wheat produced per acre. Thus, we are left with two measures of the same 
phenomena — the quantity of wheat produced per acre and the probability 
of producing that quantity of wheat. The probability function (or measure) is 
then defined based on that random variable for either case defined as 


P(X (w) =a;) =P ({w E0: X (w) =a;}). , 
Using either justification, for the rest of this text we are simply going to 
define a random variable as either a discrete (x; = 1,2,---N) or real number 
(2 = (—00, 00)). 


3.2.1 Discrete Random Variables 


Several of the examples used thus far in the text have been discrete random 
variables. For example, the coin toss is a simple discrete random variable where 
the outcome can take on a finite number of values — X = {Tails, Heads} or in 
numeric form X = {0,1}. Using this intuition, we can then define a discrete 
random variable as 


Definition 3.7. A discrete random variable is a variable that takes a count- 
able number of real numbers with certain probability. 


In addition to defining random variables as either discrete or continuous we 
can also define random variables as either univariate or multivariate. Con- 
sider the dice rolls presented in Table 3.2. Anna rolled two six-sided dice (one 
blue and one red) while Alex rolled one eight-sided die and one-six sided die. 
Conceptually, the die rolled by each individual is a bivariate discrete set of 
random variables as defined in Definition 3.8. 


Definition 3.8. A bivariate discrete random variable is a variable that takes 
a countable number of bivariate points on the plane with certain probability. 


For example, the pair {2,1} is the tenth outcome of Anna’s rolls. In most 
board games the sum of the outcomes of the two dice is the important num- 
ber — the number of spaces moved in Monopoly!™. However, in other games 
the outcome may be more complex. For example, the outcome may be whether 
a player suffers damage defined by whether the eight-sided die is greater than 
three while the amount of damage suffered is determined by the six-sided 
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TABLE 3.2 
Anna and Alex’s Dice Rolls 
Anna Alex 
Eight Six 
Roll Blue Red Sided Sided 
1 6 4 5 6 
2 6 3 8 9 
3 5 4 1 1 
4 5 3 rd 6 
5 3 5 5 1 
6 5 1 4 5 
ve 3 6 6 1 
8 4 4 5 1 
9 5 2 5 5 
10 2 1 2 5 
11 4 2 6 4 
12 2 5 3 1 
13 5 3 1 4 
14 3 2 aL 3 
15 1 4 6 6 
16 2 3 6 6 
17 3 3 3 2 
18 5 4 2 3 
19 3 3 1 3 
20 6 6 7 2 


die. Thus, we may be interested in defining a secondary random variable (the 
number of spaces moved as the sum of the result of the blue and red die or 
the amount of damage suffered by a character of a board game based on a 
more complex protocal) based on the outcomes of the bivariate random vari- 
ables. However, at the most basic level we are interested in a bivariate random 
variable. 


3.2.2 Continuous Random Variables 


While discrete random variables are important in some econometric applica- 
tions, most econometric applications are based on continuous random variables 
such as the price of consumption goods or the quantity demanded and supplied 
in the market place. As discussed in Chapter 2, defining a continuous random 
variable as some subset on the real number line complicates the definition of 
probability. Because the number of real numbers for any subset of the real 
number line is infinite, the standard counting definition of probability used 
by the frequency approach presented in Equation 2.2 implies a zero proba- 
bility. Hence, it is necessary to develop probability using the concept of a 
probability density function (or simply the density function) as pre- 
sented in Definition 3.9. 
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Definition 3.9. If there is a non-negative function f(a) defined over the 
whole line such that 


P(a, < X <4) = | f(a)dx (3.24) 


for any x, and 29 satisfying 7; < x9, then X is a continuous random variable 
and f(a) is called its density function. 


By the second axiom of probability (see definition 2.12) 


/ f(a)dx = 1. (3.25) 
The simplest example of a continuous random variable is the uniform distri- 


bution 
1if0<a2<1 
f(z) = { 0 otherwise. (3.26) 


Using the definition of the uniform distribution function in Equation 3.26, 
we can demonstrate that the probability of the continuous random variable 
defined in Equation 3.24 follows the required axioms for probability. First, 
f(x) > 0 for all x. Second, the total probability equals one. To see this, 
consider the integral 


i fle Jao =f fle yar + [fe yar + f(a as 
= [Fe ade =f dx = (alg +C = (1-0)+¢. | 


Thus the total value of the integral is equal to one if C = 0. 

The definition of a continuous random variable, like the case of the univari- 
ate random variable, can be extended to include the possibility of a bivariate 
continuous random variable. Specifically, we can extend the univariate uni- 
form distribution in Equation 3.26 to represent the density function for the 
bivariate outcome {x, y} 


1if0<a2<10<y<l 
0 otherwise. 


feu) ={ (3.28) 
The fact that the density function presented in Equation 3.28 conforms to the 


axioms of probability are left as an exercise. 


Definition 3.10. If there is a non-negative function f(x,y) defined over the 
whole plane such that 


P(ay, < X < a,y1 < Y < yo) =-[" he f (a, y) dudy (3.29) 


for £1, 2, yi, and yg satisfying x1 < x2, yi < ye, then (X,Y) is a bivariate 
continuous random variable and f (X,Y) is called the joint density function. 
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Much of the work with distribution functions involves integration. In order 
to demonstrate a couple of solution techniques, I will work through some 


examples. 


Example 3.11. If f (z,y) = cyexp (—x — y), « > 0, y > 0 and 0 otherwise, 


what is P(X >1,Y <1)? 


li oe) 
P(X >1,¥ <1)= if / aye (+) dady. 
0 1 


First, note that the integral can be separated into two terms: 


fore) 1 
P(X >1Y<1)= ‘i vets f ye “dy. 
1 0 


Each of these integrals can be solved using integration by parts: 


d(uv) = vdu+udv 
udu = d(uv) — udv 
fudu=wo—- fudv. 


In terms of a proper integral we have 


b b 
jf vdu=(wolt - [ udv. 


In this case, we have 


/ ze "dx = (—ae~* a +f e “dx = 2e7! = 0.74. 
1 1 


Working on the second part of the integral, 


1 1 
| ye Ydy = (-ye |) +f e “dy 
0 0 


Putting the two parts together, 
foe) 1 
P(X >1,Y <1) =i rede | ye “dy 
1 0 


= (0.735) (0.264) = 0.194. 


(3.30) 


(3.31) 


(3.32) 


(3.33) 


(3.34) 


(3.35) 


(3.36) 
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Definition 3.12. A T-variate random variable is a variable that takes a 
countable number of points on the T-dimensional Euclidean space with certain 
probabilities. 


Following our development of integration by parts, we have attempted to 
keep the calculus at an intermediate level throughout this textbook. However, 
the development of certain symbolic computer programs may be useful to stu- 
dents. Appendix A presents a brief discussion of two such symbolic programs — 
Maxima (an open source program) and Mathematica (a proprietary program). 


3.3. Conditional Probability and Independence 


In order to define the concept of a conditional probability, it is necessary 
to discuss joint probabilities and marginal probabilities. A joint probability 
is the probability of two random events. For example, consider drawing two 
cards from the deck of cards. There are 52 x 51 = 2,652 different combinations 
of the first two cards from the deck. The marginal probability is the overall 
probability of a single event or the probability of drawing a given card. The 
conditional probability of an event is the probability of that event given that 
some other event has occurred. Taking the roll of a single die, for example — 
what is the probability of the die being a one if you know that the face number 
is odd? (1/3). However, note that if you know that the roll of the die is a one, 
the probability of the roll being odd is 1. 

As a starting point, consider the requirements (axioms) for a conditional 
probability to be valid. 


Definition 3.13. Axioms of Conditional Probability: 
1. P(A|B) > 0 for any event A. 
2. P(A|B) =1 for any event AD B. 
3. If {A;N B} 71 =1,2,... are mutually exclusive, then 
P(A; U AQU...) = P(A1 |B) 4+ P (Ao |B)4+--- (3.37) 
4. If BD H, BD G, and P(G) £0 then 


P(H|B) _ P(H) 
P(GIB) = P(G): (3.38) 


Note that Axioms 1 through 3 follow the general probability axioms with the 
addition of a conditional term. The new axiom (Axiom 4) states that two 
events conditioned on the same probability set have the same relationship as 
the overall (as we will develop shortly — marginal) probabilities. Intuitively, 
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the conditioning set brings in no additional information about the relative 
likelihood of the two events. 
Theorem 3.14 provides a formal definition of conditional probability. 


Theorem 3.14. P(A|B) = P(AN B)/P(B) for any pair of events A and 
B such that P(B) > 0. 


Taking this piece by piece — P(ANM B) is the probability that both A and B 
will occur (i.e., the joint probability of A and B). Next, P (B) is the probability 
that B will occur. Hence, the conditional probability P (A| B) is defined as 
the joint probability of A and B given that we know that B has occurred. 
Some texts refer to Theorem 3.14 as Bayes’ theorem; however, in this text we 
will define Bayes’ theorem as depicted in Theorem 3.15. 


Theorem 3.15 (Bayes’ Theorem). Let Events Aj, A2,...An be mutually 
exclusive events such that P (A; U Ag U-::An) = 1 and P(A;) > 0 for each i. 
Let E be an arbitrary event such that P(E) > 0. Then 


P(EJA)P (Ai) 


dP (EIA;)P (Aj) 


P(A; |E) = (3.39) 


While Equation 3.39 appears different from the specification in Theorem 3.14, 
we can demonstrate that they are the same concept. First, let us use the 
relationship in Theorem 3.14 to define the probability of the joint event EN A;. 


P(ENA;) =P(E|A;)P (Ai). (3.40) 


Next, if we assume that events A;, Ag,--- are mutually exclusive and exhaus- 
tive, we can rewrite the probability of event E as 


P(E) = SF (E|A;)P (Aj). (3.41) 


i=l 


Combining the results of Equations 3.40 and 3.41 yields the friendlier version 
of Bayes’ theorem found in Thereom 3.14: 


P (A;|H) = PEE OAD 


P(E) (3.42) 


Notice the direction of the conditional statement — if we know that event E 
has occurred, what is the probability that event A; will occur? 

Given this understanding of conditional probability, it is possible to de- 
fine statistical independence. One random variable is independent of the 
probability of another random variable if 


Definition 3.16. Events A and B are said to be independent if P(A) = 
P(A|B). 
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Hence, the random variable A is independent of the random variable B if 
knowing the value of B does not change the probability of A. Extending the 
scenario to the case of three random variables: 


Definition 3.17. Events A, B, and C are said to be mutually independent 
if the following equalities hold: 


a) P(ANB) = P(A) P(B) 


b) P(ANC) = P(A)P(C) 


Cc 


) P( 
) P( 
) P(BNC) =P(B)P(C) 
) P( 


d) P(ANBNC) =P(A)P(B)P(C) 


3.3.1 Conditional Probability and Independence for Discrete 
Random Variables 


In order to develop the concepts of conditional probability and independence, 
we start by analyzing the discrete bivariate case. As a starting point, we define 
the marginal probability of a random variable as the probability that a given 
value of one random variable will occur (i.e., X = 2;) regardless of the value 
of the other random variable. For this discussion, we simplify our notation 
slightly so that P[X =2,;N Y =y,] = P[X =2,,Y =y,] = P[zi,y,]. The 
marginal distribution for x; can then be defined as 


=>°P [ri (3.43) 
j=l 


Turning to the binomial probability presented in Table 3.3, the marginal prob- 
ability that X = a (i.e., X = 0) can be computed as 


P [x1] = P[ailyi] + P [zi] yo] +---P [x1] ye] 


= 0.01315 + 0.04342 + 0.05790 + 0.03893 + 0.01300 + 0.00158 = 0.16798. 
(3.44) 
By repetition the marginal value for each X; and Y; is presented in Table 3.3. 
Applying a discrete form of Bayes’ theorem, 


Plax yj] = (3.45) 
: P (y;) 
we can compute the conditional probability of X = 0 given Y = 2 as 
0.581 
P [21| y3] = ——— = 0.16881. (3.46) 


0.3456 


Mathematical Statistics for Applied Econometrics 


58 


6€600'0 OFL20°0 LLIEZO G9PPED LP8GZ'O G6Z8L00 AMqeqorg 
[eULsIeyy 

216600'0 00000°0 + 8sT000°0 €&000°0 62000°0 649000°0 8100070 G 
8€860°0 26000°0 0cz00°0 8&900°0 82600°0 €€200°0 ccz00°0 V 
OLZET'O GZI00'°0 220100 G2l0E0'0 PLSV0'O OEVEOO 6€0T0'0 € 
0L80€°0 06c00°0 68€20'0 GSTL00 OF90T'O 626200 LTVC0'0 G 
866GE°0 6€€00'0 98220'0 EVE800 B807ZT'O0 POE6OO 8T8c0'0 I 
86L9T'0 8ST00'0 O0ETO'O €68E0'0 06450'0 crErO'0 STETOO 0 

AyTIqeqoidg G v € j I 0 x 
[eulsrepy 


Aypiqeqorg yeruourg 
ee ATAVL 


Random Variables and Probability Distributions 59 


TABLE 3.4 
Binomial Conditional Probabilities 
X P([X,Y =2) P[Y=2] P[xXlyY=2] P[X] 

0.05790 0.34469 0.16798 0.16798 
0.12408 0.34469 0.35998 0.35998 
0.10640 0.34469 0.30868 0.30870 
0.04574 0.34469 0.13270 0.13270 
0.00978 0.34469 0.02837 0.02838 
0.00079 0.34469 0.00229 0.00227 


oRWNF OO 


Table 3.4 presents the conditional probability for each value of X given Y = 2. 
Next, we offer a slightly different definition of independence for the discrete 
bivariate random variable. 


Definition 3.18. Discrete random variables are said to be independent if 
the events X = x; and Y = y; are independent for all 7,7. That is to say, 
P (xi, 43) = P (ai) P (yj). 


To demonstrate the consistency of Definition 3.18 with Definition 3.16, note 
that 
P ly;] 
Therefore, multiplying each side of the last equality in Equation 3.47 yields 
P [x] x P [yj] = P [xi, yy]. 
Thus, we determine independence by whether the P [z;,y;] values equal 
P [x;] x P [y;]. Taking the first case, we check to see that 


P [x] = P [ail yj] > P [ai] = (3.47) 


P [ay] x P [y1] = 0.1681 x 0.0778 = 0.0131 = P [x1, 4]. (3.48) 


Carrying out this process for each cell in Table 3.3 confirms the fact that 
X and Y are independent. This result can be demonstrated in a second way 
(more consistent with Definition 3.18). Note that the P[X|Y = 2] column in 
Table 3.4 equals the P [X] column — the conditional is equal to the marginal 
in all cases. 

Next, we consider the discrete form of the uncorrelated normal distribution 
as presented in Table 3.5. Again, computing the conditional distribution of X 
such that Y = 2 yields the results in Table 3.6. 


Theorem 3.19. Discrete random variables X and Y with the probability dis- 
tribution given in Table 3.1 are independent if and only if every row is pro- 
portional to any other row, or, equivalently, every column is proportional to 
any other column. 


Finally, we consider a discrete form of the correlated normal distribution 
in Table 3.7. To examine whether the events are independent, we compute 
the conditional probability for X when Y = 2 and compare this conditional 
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TABLE 3.6 
Uncorrelated Normal Conditional Probabilities 
P[X,Y =2) P[Y =2] P[xXly=2]  P[X] 

0.02520 0.35222 0.07155 0.07154 
0.08503 0.35222 0.24141 0.24142 
0.13191 0.35222 0.37451 0.37451 
0.08488 0.35222 0.24099 0.24098 
0.02259 0.35222 0.06414 0.06413 
0.00261 0.35222 0.00741 0.00741 


oRwWwNr ox 


distribution with the marginal distribution of X. The results presented in 
Table 3.8 indicate that the random variables are not independent. 


3.3.2 Conditional Probability and Independence for 
Continuous Random Variables 


The development of conditional probability and independence for continu- 
ous random variables follows the same general concepts as discrete random 
variables. However, constructing the conditional formulation for continuous 
variables requires some additional mechanics. Let us start by developing the 
conditional density function. 


Definition 3.20. Let X have density f(x). The conditional density of X 
given a < X < b, denoted by f (a|a < X < b), is defined by 


f(x) 


f(tja<xX<b)=—| fora<a2<ob, 


f(u)dx (3.49) 


a 


= 0 otherwise. 


Notice that Definition 3.20 defines the conditional probability for a single 
continuous random variable conditioned on the fact that the random variable 
is in a specific range (a < X < b). This definition can be expanded slightly by 
considering any general range of the random variable X (X € S$). 


Definition 3.21. Let X have the density f(x) and let S be a subset of the 
real line such that P(X € S) > 0. Then the conditional density of X given 
X €S, denoted by f (2z|S), is defined by 


f (a) 


Fs res S 


(3.50) 


=0 otherwise. 
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TABLE 3.8 

Correlated Normal Conditional Probabilities 

P[X,Y =2) PlY =2] P[xXlY=2]  P[X] 
0.02632 0.36260 0.07259 0.07890 
0.08774 0.36260 0.24197 0.24057 
0.13529 0.36260 0.37311 0.36216 
0.08711 0.36260 0.24024 0.23962 
0.02343 0.36260 0.06462 0.06962 
0.00271 0.36260 0.00747 0.00916 


oR WN OX 


To develop the conditional relationship between two continuous random 
variables (i.e., f(x] y)) using the general approach to conditional density func- 
tions presented in Definitions 3.20 and 3.21, we have to define the marginal 
density (or marginal distribution) of continuous random variables. 


Theorem 3.22. Let f (x,y) be the joint density of X and Y and let f (x) be 
the marginal density of X. Then 


f(ay= fF ley) dy. (3.51) 


Going back to the distribution function from Example 3.11, we have 
Fina Saye (3.52) 


To prove that this is a proper distribution function, we limit our consideration 
to non-negative values of x and y (ie., f(x,y) > 0 if a, y > 0). From our 
previous discussion it is also obvious that 


| | ie aia ( | ees ir) ie yr) 
= (- (we-®|° + i: de) (- (ye °° + ve vay) (3.53) 


=(-(600-04)= 020) Geto.) 0=1yS1 


Thus, this is a proper density function. The marginal density function for x 
follows this formulation: 


fe) = f° pleu)ay= (ee) [yey 


= (ve-*) (- (evi + | ee iy) (3.54) 


=ze ”. 
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FIGURE 3.2 
Quadratic Probability Density Function. 


Example 3.23. Consider the continuous bivariate distribution function 


f(a,y) = ; (Ge + y) forz,y € [0,1] (3.55) 


which is depicted graphically in Figure 3.2. First, to confirm that Equa- 
tion 3.55 is a valid distribution function, 


Cen Lae ier Sf 4 1 
= dt = — * 3 ~,,3 
5 [t+ ae (desk 


(3.56) 


Further, f (z,y) > 0 for all x,y € [0,1]. To prove this rigorously we would 
show that f (x,y) is at a minimum at {x,y} = {0,0} and that the derivative 
of f (x,y) is positive for all x, y € [0,1]. 

This example has a characteristic that deserves discussion. Notice that 
f (1,1) =3 > 1; thus, while the axioms of probability require that f (x,y) > 0, 
the function can assume almost any positive value as long as it integrates 
to one. Departing from the distribution function in Equation 3.55 briefly, 
consider the distribution function g (z) = 2 for z € [0,1/2]. This is a uniform 
distribution function with a more narrow range than the U [0,1]. It is valid 
because g (z) > 0 for all z and 


1/2 1 
| ade = 2 ( 219”? =9 (5-9) = (3.57) 
; 2 
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Hence, even though a distribution function has values greater than one, it may 
still be a valid density function. 


Returning to the density function defined in Equation 3.55, we derive the 
marginal density for x: 


fa = f 5 (0 +92) dy 


Eee ae ee 
5° | v5 [vay 


(3.58) 


While the result of Equation 3.58 should be a valid probability density function 
by definition, it is useful to make sure that the result conforms to the axioms 
of probability (e.g., it provides a check on your mathematics). First, we note 
that f (a) > 0 for all x € [0,1]. Technically, f (~) = 0 if « ¢ [0,1]. Next, to 
verify that the probability is one for the entire sample set, 


1 1 1 
') (52*+5) =5/ vac +5 | dx 
0 \2 2 2 Jo 2 Jo 


i ee 
See +5 (ak (3.59) 
a eae a 
x 2h 
=e ene 
272 


Thus, the marginal distribution function from Equation 3.58 meets the criteria 
for a probability measure. 


Next, we consider the bivariate extension of Definition 3.21. 


Definition 3.24. Let (X,Y) have the joint density f(x,y) and let S be 
a subset of the plane which has a shape as in Figure 3.3. We assume that 
P[(X,Y) € S] > 0. Then the conditional density of X given (X,Y) € S, 
denoted f (x|5'), is defined by 


g(x) 
[, (ene 
f(z|S)= 4 22% fp a ca <d (3.60) 
Piemes oo. 
0 otherwise. 
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FIGURE 3.3 


Conditional Distribution for a Region of a Bivariate Uniform Distribution. 


Building on Definition 3.24, consider the conditional probability of « < y for 
the bivariate uniform distribution as depicted in Figure 3.3. 


Example 3.25. Suppose f(z,y) = 1 forO0 <a <1,0<y <1, and 0 
otherwise. Obtain f (x|X < Y). 


ee (3.61) 
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FIGURE 3.4 
Conditional Distribution of a Line for a Bivariate Uniform Distribution. 


Notice the downward sloping nature of Equation 3.61 is consistent with the 
area of the projection in the upper right diagram of Figure 3.3. Initially each 
increment of y implies a fairly large area of probability (i-e., the difference 
1 — x). However, as y increases, this area declines. 

Suppose that we are interested in the probability of X along a linear 
relationship Y = y; + cX. As a starting point, consider the simple bivariate 
uniform distribution that we have been working with where f(z,y) = 1. 
We are interested in the probability of the line in that space presented in 
Figure 3.4. The conditional probability that X falls into [21,22] given Y = 
y1 + cX is defined by 


P(ay < X <a|¥ =yt+cxX)= 


3.62 
lim P(a1 < X <axely.t+cX <Y <yo4cX) ( ) 
yo 41 

for all 71, xq satisfying x1 < x2. Intuitively, as depicted in Figure 3.5, we are 
going to start by bounding the line on which we want to define the conditional 
probability (ie, Y = yy +cX <Y = yf +cX < Y = yo+ CX). Then we 
are going to reduce the bound y; — ye, leaving the relationship for yj. The 
conditional density of X given Y = y; + cX, denoted by f (a|/Y = yi +cX), 
if it exists, is defined to be a function that satisfies 


x2 
P(t <X <agl¥ =m +X) = | f(al¥ = y1 + cX) da. (3.63) 
xy 
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FIGURE 3.5 
Bounding the Conditional Relationship. 


In order to complete this proof we will need to use the mean value theorem 
of integrals. 


Theorem 3.26. Let f(x) be a continuous function defined on the closed 
interval [a,b]. Then there is some number X in that interval (a < X < b) 
such, that 


b 
i f (x) dx = (b—a) f (X). (3.64) 


(48, p. 45] 


The intuition for this proof is demonstrated in Figure 3.6. We don’t know what 
the value of X is, but at least one X satisfies the equality in Equation 3.64. 


Theorem 3.27. The conditional density f(a|Y = y1+cX) exists and is 
given by 


f(2)¥ =y+cXx) (3.65) 


provided the denominator is positive. 
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FIGURE 3.6 
Mean Value of Integral. 


Proof. We have 
lim P(a,<X < aly. tcX <Y <y2+cX) 


y2>yl 


£2 py2tea 
/ f (x,y) dydx (3.66) 


Y2> "1 yarer . 
if i Ff Gaydedy 
—oo Jyitcr 


Thus, by the mean value of integration, 


x2 


2 Yyotcon 
} f (0,9) dyde Pe ahicaae 
: ZL Yyiteoxn = . ie 
yl poo pyar ten = ino - (3.67) 
ff tenaety °°? [te ten) a0 
—oo Jy tex 68 


where y; < y* < yo. AS y2 > y1, Y* > Yr, hence 
x2 


i f(a, y* + cx) dx f (a, y1 + cx) dx 


lm —S% = 75 (3.68) 
142 
aa i f (a, y* + cx) dx i f (x, yr + cx) dx 
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The transition between Equation 3.67 and Equation 3.68 starts by assuming 
that for some 


Yotcou 
y* € [y1, ye] > f (x,y) dady = f(x,y" + cx). (3.69) 
yiter 
Thus, if we take the limit such that yo — yi and yi < y* < ye, then y* > y1 
and 

f(z, y" + cx) > f(t,y1 + er). (3.70) 
Finally, we consider the conditional probability of X given that Y is re- 

stricted to a single point. 


Theorem 3.28. The conditional density of X given Y = yi, denoted by 
f (al yi), ts given by 

f (z,y1) 
f(y) — 


Note that a formal statement of Theorem 3.28 could follow Theorem 3.27, 
applying the mean value of the integral to a range of X. 

One would anticipate that continuous formulations of independence could 
follow the discrete formulation such that we attempt to show that f(x) = 
f (| y). However, independence for continuous random variables simply re- 
lates to the separability of the joint distribution function. 


f (aly) = (3.71) 


Definition 3.29. Continuous random variables X and Y are said to be in- 
dependent if f (x,y) = f (x) f (y) for all x and y. 


Again returning to Example 3.11, 


f (x,y) = eyexp[— (@ + y)] = (wexp[-2]) (yexp [-y]). (3.72) 


Hence, X and Y are independent. In addition, the joint uniform distri- 
bution function is independent because f(z,y) = 1 = g(a)h(y) where 
g(x) = h(y) = 1. This simplistic definition of independence can be easily 
extended to T’ random variables. 


Definition 3.30. A finite set of continuous random variables X,Y, Z,--- are 
said to be mutually independent if 
f(@,y,2,-+*) = g(x) h(y)i(z)---. (3.73) 


A slightly more rigorous statement of independence for bivariate continu- 
ous random variables is presented in Theorem 3.31. 


Theorem 3.31. Let S be a subset of the plane such that f (x,y) > 0 over S 
and f (x,y) = 0 outside of S. Then X and Y are independent if and only if 
S is a rectangle (allowing —oco or oo to be an end point) with sides parallel 
to the axes and f (a,y) = g(a) /h(y) over S, where g(x) and h(y) are some 
functions of x and y, respectively. Note that g(x) = cf (x) for some c, h(y) = 


Od Ci) s 
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3.4 Cumulative Distribution Function 


Another transformation of the density function is the cumulative density func- 
tion, which gives the probability that a random variable is less than some 
specified value. 


Definition 3.32. The cumulative distribution function of a random variable 
X, denoted F' (x), is defined by 


F(x) =P(X <2) (3.74) 


for every real x. In the case of a discrete random variable 


F(e:)= >° P(a;). (3.75) 


LiL 


In the case of a continuous random variable 
Pad i f (t) dt. (3.76) 


In the case of the uniform distribution 


es Oifx<0 
F(a)= | dtd (i? if0<2<1 (3.77) 
72 lifa>l. 


The cumulative distribution function for the uniform distribution is presented 
in Figure 3.7. 

We will develop the normal distribution more fully over the next three 
sections, but certain aspects of the normal distribution add to our current 
discussion. The normal distribution is sometimes referred to as the bell curve 
because its density function, presented in Figure 3.8, has a distinctive bell 
shape. One of the vexing characteristics of the normal curve is the fact that its 
anti-derivative does not exist. What we know about the integral of the normal 
distribution we know because we can integrate it over the range (—00, 00). 
Given that the anti-derivative of the normal does not exist, we typically rely 
on published tables for finite integrals. The point is that Figure 3.9 presents 
an empirical cumulative distribution for the normal density function. 


3.5 Some Useful Distributions 


While there are an infinite number of continuous distributions, a small number 
of distributions account for most applications in econometrics. 
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FIGURE 3.7 
Cumulative Distribution of the Uniform Distribution. 


FIGURE 3.8 
Normal Distribution Probability Density Function. 
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F(x)= f f(z)d& 


FIGURE 3.9 
Normal Cumulative Distribution Function. 


A. 


Uniform distribution: In our discussion we have made frequent use of 
the U [0,1] distribution. A slightly more general form of this distribution 
can be written as 


i 
f (ela, b) -| in (3.78) 


0 otherwise. 


Apart from the fact that it is relatively easy to work with, the uniform 
distribution is important for a wide variety of applications in econometrics 
and applied statistics. Interestingly, one such application is sampling 
theory. Given that the cumulative density function for any distribution 
“maps” into the unit interval (i.e., F’ (~) — [0,1]), one way to develop sam- 
ple information involves drawing a uniform random variable (z ~ U [0,1] 
read as z is distributed U [0,1]) and determining the value of x associated 
with that probability (i-e., 2 = F~' (z) where F~' (z) is called the inverse 
mapping function). 


. Gamma distribution: The gamma distribution has both pure statisti- 


cal uses and real applications to questions such as technical inefficiency 
or crop insurance problems. Taking the statistical applications first, the 
x? distribution is a form of a gamma distribution. The x? distribution 
is important because if x is a standard normal distribution (i.e., a nor- 
mal distribution with a mean of zero and a standard deviation of one) 
x? ~ x2. Thus, variances tend to have a x distribution. From a technical 


inefficiency or crop yield perspective, the one-sided nature (i.e., the fact 
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that « > 0) makes it useful. Essentially, we assume that every firm is tech- 
nically inefficient to some degree. Thus, we use the gamma distribution by 
letting x be the level of technical inefficiency. Mathematically, the gamma 
distribution can be written as 


1 
f(vla,8)=*% Ta) pe” 


Sale ark givn0<a2<co, a>0,6>0 
0 otherwise. 
(3.79) 


Unfortunately, the gamma function I (a) is also another numerical func- 
tion (e.g., like the cumulative distribution function presented in Section 
3.4). It is defined as 


re) [tea (3.80) 


. Normal distribution: The normal distribution is to the econometrician 


what a pair of pliers is to a mechanic. Its overall usefulness is related to the 
central limit theorem, which essentially states that averages tend to be 
normally distributed. Given that most estimators including ordinary least 
squares are essentially weighted averages, most of our parameter estimates 
tend to be normally distributed. In many cases throughout this textbook 
we will rely on this distribution. Mathematically, this distribution can be 
written as 


l 02 
He |mo?) = ae | we | forall -—co<au<o (3.81) 


where jz is the mean and o? is the variance. 


. Beta distribution: The bounds on the beta distribution make it useful 


for estimation of the Bernoulli distribution. The Bernoulli distribution 
is the standard formulation for two-outcome random variables like coin 
tosses. 

Pig) =p" (lop), ep S40 OS pe, (3.82) 


Given the bounds of the probability of a “heads,” several Bayesian estima- 
tors of p often use the beta distribution, which is mathematically written 


1 
f(pla,8)= 4 B(a,8) 


p?-*(1—p)?", for0<p<1,a>0, B>0 
0 otherwise. 
(3.83) 
The beta function (B (a, 3)) is defined using the gamma function: 


l(a)T (8) 


Bla, 6) = Faas: 


(3.84) 
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3.6 Change of Variables 


Change of variables is a technique used to derive the distribution function 
for one random variable by transforming the distribution function of another 
random variable. 


Theorem 3.33. Let f (x) be the density of X and let Y = ¢(X), where ¢ is 
a monotonic differentiable function. Then the density g(y) of Y is given by 


ral (3.85) 


g(y) =f [eo * y)] x re 


The term monotonic can be simplified to a “one-to-one” function. In other 
words, each x is associated with one y over the range of a distribution func- 
tion. Hence, given an z, a single y is implied (e.g., the typical definition of a 
function) and for any one y, a single x is implied. 


Example 3.34. Suppose f (x) = 1 for 0 < x < 1 and 0 otherwise. Assuming 
Y = X?, what is the distribution function g(y) for Y? First, it is possible 
to show that Y = X? is a monotonic or one-to-one mapping over the rele- 
vant range. Given this one-to-one mapping, it is possible to derive the inverse 
function: 


(2) = 2? > 9" (y) = V¥. (3.86) 
Following the definition: 
1 


=a (3.87) 


Extending the formulation in Equation 3.33, we can envision a more com- 
plex mapping that is the sum of individual one-to-one mappings. 


Theorem 3.35. Suppose the inverse of ¢ (x) is multivalued and can be written 
as 


Ce) Gee at. (3.88) 


Note that ny indicates the possibility that the number of values of x varies 
with y. Then the density g(y) of Yis given by 


9) =) hl ey 


where f (.) is the density of X and ¢' is the derivative of ¢. 
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One implication of Theorem 3.35 is for systems of simultaneous equations. 
Consider a very simplified demand system: 


Periheae eles (3.90) 


where y; is the quantity supplied and demanded, yz is the price of the good, 
x1 is the price of an alternative good, and x2 is consumer income. We develop 
the matrix formulations in Chapter 10, but Equation 3.90 can be rewritten 


ee 
YUu= 5 XY 52 
(3.91) 
2/18 28 
Yy2= 5 xy 5 rQ- 


The question is then — if we know something about the distribution of x; and 
x2, can we derive the distribution of y; and y2? The answer of course is yes. 


Theorem 3.36. Let f (x1,x%2) be the joint density of a bivariate random 
variable (X1,X2) and let (Yi, Y2) be defined by a linear transformation 


Yy = a41X1 + 442XQ 


3.92 
Yo = a91X1 + a92X9. a) 


Suppose a 11422 — a12421 # 0 so that the equations can be solved for X, and 
X2 as 
X= b1,Y1 + by2¥2 


X2q = bY, + b22Yo. 0998) 
Then the joint density g (yi, y2) of (Y1, Y2) is given by 
Cee f (bry + bizye, ba1y1 + b224y2) (3.94) 


|a11422 a a12421| 


where the support of g, that is, the range of (y1,y2) over which g is positive, 
must be appropriately determined. 


Theorem 3.36 is used to derive the Full Information Maximum Likeli- 
hood for systems with endogenous variables (i.e., variables with values that 
are determined inside the system like the quantity supplied and demanded, 
and the price that clears the market). Appendix B presents the maximum 
likelihood formulation for a system of equations. 


3.7 Derivation of the Normal Distribution Function 


As stated in Section 3.5, the normal distribution forms the basis for many 
problems in applied econometrics. However, the formula for the normal distri- 
bution is abstract to say the least. As a starting point for understanding the 
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normal distribution, consider the change in variables application presented in 
Example 3.37. 


Example 3.37. Assume that we want to compute the probability of an event 
that occurs within the unit circle given the standard bivariate uniform dis- 
tribution function (f (z,y) = 1 with Y? < 1— X? given 0 < x,y < 1). The 
problem can be rewritten slightly - Y < /1— X? — Y? + X? < 1. Hence, 
the problem can be written as 


p(xt+y? <i) =f (fan) ae= [i vim (3.95) 


As previously stated, we will solve this problem using integration by change 
in variables. By trigonometric identity 1 = sin? (x) + cos? (x). Therefore, 
sin? (x) = 1 — cos? (a). The change in variables is then to let x = cos (t). 
The integration by change in variables is then 


i ice | * f(b (é)]¢ (t) dt (3.96) 


such that t) = ¢~' (x1) and tg = ¢~1 (a2). Thus, in explicit form our trans- 
formation becomes 


x2 to 7 
/ Vie if coe pos ) ae. (3.97) 
Ly ti 


Given that 
1 = sin? (t) + cos” (t) implies sin (t) = \/1— cos? (t) and 
3.98 
Ocos(t) | sin (t) co28) 
Ot 
the transformed integral becomes 
te to 
i sin (t) x —sin (t) dt = i — sin? (t) dt. (3.99) 
ty ty 


To complete the transformation we derive the bounds of integration. However, 
notice that t; = cos~'(0) = 7/2 (or 90 degrees) while tg = cos~! (1) = 0, 
implying 
0 
i —sin? (t) dt (3.100) 
m/2 

or that the order of the bounds of the integral are opposite from the standard 
case. The solution is to reverse the order of integration: 


0 n/2 
/ —sin? (t) dt = | sin? (t) dt. (3.101) 
n/2 0 
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y= f(x) 6.0 + 


f (2) =4.20> r = 4.65,0 = 0.35852 


f(4)=18 
=r =4.38,0 = 0.13467 


: 
1 
l L 1 go 1 
-6.00 -4.00 -2.00 0.00 2.00 4. 


90 rae 
x 
FIGURE 3.10 
Simple Quadratic Function. 
The value of the integral is then 
es 1 1 ze 
| sie ane | aeRO). 2a (3.102) 
0 2 2 6 4 


While Example 3.37 appears simple enough, it opens the door to some very 
powerful tools of functional analysis. Specifically, while most transformations 
appear minor (i.e., taking the square root of a variable in Example 3.34 or a 
linear transformation of two variables in Theorem 3.36) more radical trans- 
formations of the variable space are possible. One such transformation is the 
polar functional form. 

Refer to the quadratic function 

Pe 
y=f (x4) =5- 3% € [-5, 5] (3.103) 


depicted in Figure 3.10. Consider the point f(4.0) = 1.8; the length of the ray 
from the origin to that point on the function can be computed as 


r (4, 1.8) = /42 + 1.8? = 4.38. (3.104) 


We can also compute the value of the inscribed angle. To do this we start by 
noting that 


1. 
tan (0 (4, 1.8)) = “* = 0.45 > 0(4, 1.8) = tan~! (0.45) = 0.13467. (3.105) 


Repeating the process for f (2) = 4.2 yields r (2, 4.2) = 4.20 and 6 (2,4.2) = 
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-0.6 -0.4 -0.2 0 0.2 0.4 0.6 


0 
FIGURE 3.11 
Polar Transformation of Simple Quadratic. 
0.35857. Applying this transformation at a sequence of points between x = —5 


and x = 5 yields the transformation presented in Figure 3.11. Given that 
we can define a function g(0,r(6)) that defines these points, we have an 
alternative respresentation of the simple quadratic function in Equation 3.103. 
For example, we could approximate the function as 


r (0) = 4.6584. (3.106) 


The approximation error could be reduced by adding additional terms (i.e., 
r (0) =a+b0+c6? — see the discussion in Appendix C). 

Intuitively, we could integrate the simple quadratic following Example 3.37, 
but it is obvious that such an integration would be more trouble than it is 
worth. However, the polar transformation simplifies some complex integrals 
such as the normal density function. 

To develop the normal distribution, we start with the standard normal 
(i.e., x ~ N(0,1)), which can be written as 


f(x)= @ 2. (3.107) 


First, we need to demonstrate that the distribution function does integrate 
to one over the entire sample space, which is —co to oo. This is typically 
accomplished by proving the constant. Let us start by assuming that 


OO, SY 
=i) e 2 dy. (3.108) 


—Co 
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Squaring this expression yields 
2 


lo) 2 co 
I= e€ dy e dx 


—oo —Co 


(3.109) 


2 2 
oo Yo te 
=| e dydax. 


The trick to this integration is changing the variables into a polar form. Fol- 
lowing the preceding discussion, 


pe eee 

6 = tan! (2/y) 
y =rcos (6) 
x=rsin(0). 


(3.110) 


We apply the change in variable technique to change the integral into the 
polar space. First, we transform the variables of integration 


dydx = rdrdé. (3.111) 


Folding these two results together we get 


20 co _r 20 
= i i re 2drd0= d0 = 2r. (3.112) 
0 0 0 


A couple of points about the result in Equation 3.112; first note that 


2 
2 2 


7 _r _7r ae 
Bee =. T= fre Dee oF, (3.113) 


Second, the distance function is non-negative by definition (i.e., \/x? + y? > 
0). Hence, the range of the inner integral in Equation 3.112 is r € [0,00). 
Taking the square root of each side yields 


T= V2n. (3.114) 
Thus, we know that 
ice! y 
e 2dy=1. 3.115 
| zt (3.115) 


The expression in Equation 3.115 is referred to as the standard normal. 
A more general form of the normal distribution function can be derived by 
defining a transformation function. Defining 


y=at+b«r 
y-a (3.116) 


i 
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by the change in variable technique, we have 


(3.117) 


As presented in Section 3.5 we typically denote a as y (i.e., the mean) and b 
as 0? (i.e., the variance). 


3.8 An Applied Sabbatical 


In the past, farmers received assistance during disasters (i.e., drought or 
floods) through access to concessionary credit. Increasingly, during the last 
10 years of the 20th century, agricultural policy in the United States shifted 
toward market-based crop insurance. This insurance was supposed to be ac- 
tuarially sound so that producers would make decisions that were consistent 
with maximizing economic surplus. Following the discussion of [36], the loss 
of a crop insurance event could be parameterized as 


L=AC-—AR (3.118) 


where C is the level of coverage (i.e., the number of bushels guaranteed under 
the insurance policy, typically 10, 20, or 40 percent of some expected level of 
yield), A is the probability of that level of yield, R is the expected value of the 
yield given that an insured event has occurred, L is the insurance indemnity or 
actuarially fair value of the insurance. Given these definitions, the insurance 
indemnity becomes 
Cc 
L=[ (C —y) dF (y). (3.119) 
lo) 
This loss is in yield space; it ignores the price of the output. Apart from the 
question of prices, a critical part of the puzzle is the distribution function 


dF (y) = f (y) dy. (3.120) 


Differences in the functional form of the distribution function imply differ- 
ent insurance premiums for producers. The goal of the selection of a distribu- 
tion function is for the distribution function to match the actual distribution 
function of crop yields. Differences between the actual distribution function 
and the empirical form used to estimate the premium leads to an economic 
loss. If a distribution systematically understates the probability of lower re- 
turn, farmers could make an arbitrage gain by buying crop insurance. If a 
distribution systematically overstates the probability of a lower return, farm- 
ers would not buy the insurance (it is not a viable instrument). 

The divergence between the relative probabilities is a function of the flex- 
ibility of the distribution’s moments (developed more fully in Chapter 4). 
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e Expected value: First moment 


fe i xf (x) da. (3.121) 
e Variance: Second central moment 
Cc ics 2 
WS =f (=m)? F(o)ae. (3.122) 


e Skewness: Third central moment 


wo =f (e—m)* F (0) de. (3.123) 
e Kurtosis: Fourth central moment 

Cc re 4 

wf =f (e-m)* F(a) ae. (3.124) 


Each distribution implies a certain level of flexibility between the moments. 
For the normal distribution, all odd central moments are equal to zero, which 
implies that the distribution function is symmetric. In addition, all even mo- 
ments are a function of the second central moment (i.e., the variance). Moss 
and Shonkwiler [33] propose a distribution function that has greater flexibility 
based on the normal (specifically in the third and fourth moments). This new 
distribution (presented in Figure 3.12) is accomplished by parametric trans- 
formation to normality. The distribution is called an inverse hyperbolic sine 


transformation: P i 
ln @ + (eee) + 1| ) 


0 


&=%4—0. 


(3.125) 


ee = 


Norwood, Roberts, and Lusk [37] evaluate the goodness of yield distributions 
using a variant of Kullback’s [28] information criteria: 
f (X64) 


I= [. f (X07) In (se =. (3.126) 


Like most informational indices, this index reaches a minimum of zero if the 
two distribution functions are identical everywhere. Otherwise, a positive num- 
ber reflects the magnitude of the divergence. The Norwood, Roberts, and Lusk 
[37] model then suggests that a variety of models can be tested against each 
other by comparing their out-of-sample measure. This measure is actually 
constructed byletting the probability of an out-of-sample forecast equal 1/N 
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FIGURE 3.12 
Inverse Hyperbolic Sine Transformation of the Normal Distribution. 


where N is the number of out-of-sample draws. 


= (3.127) 
N 
= 5 (y) wel) 
ae? ha 
siI«l= — Hy Dm (9 (X89) 


Ignoring the constants, the measure of goodness becomes negative. The 
more negative the number, the less good is the distributional fit. Norwood, 
Roberts, and Lusk then construct a number of out-of-sample measures of I. 


3.9 Chapter Summary 


e Random Variables are real numbers whose outcomes are elements of a 
sample space following a probability function. 
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— Discrete Random Variables take on a finite number of values. 


— Continuous Random Variables can take on an infinite number of 
values. Hence, the probability of any one number occurring is zero. 


e In econometrics, we are typically interested in a continuous distribution 


function (f (x)) defined on a set of real numbers (i.e., a subset of the 
real number line — f (x) is defined on x € [ao,x1]). The properties of the 
distribution function then become: 


— f (a) > 0 with f (x) > 0 for x € [xo, x4]. 


—- Nee f (a) da = 1. 
— Additivity is typically guaranteed by the definition of a single valued 
function on the real number line. 


Measure theory allows for a more general specification of probability func- 
tions. However, in econometrics we typically limit our considerations to 
simplier specifications defining random variables as subsets of the real 
number space. 


If two (or more) random variables are independent, their joint probability 
density function can be factored f (x,y) = fi (x) fo (y). 


The conditional relationship between two continuous random variables can 


be written as 
(a, yo) 


wi f (x,y) dx 


f (zy) (3.128) 


y¥=Yo 


3.10 Review Questions 


3-1R. Demonstrate that the outcomes of dice rolls meet the criteria for a 


Borel set. 


3-2R. Construct the probability of damage given the outcome of two die, one 


with eight sides and one with six sides. Assume that damage occurs 
when the six-sided die is 5 or greater, while the amount of damage 
is given by the outcome of the eight-sided die. How many times does 
damage occur given the outcomes from Table 3.2? What is the level 
of that damage? Explain the outcome of damage using a set-theoretic 


mapping. 


3-3R. Explain why the condition that f (x,y) = fi (x) fa (y) is the same re- 


lationship for independence as the condition that the marginal distri- 
bution for x is equal to its conditional probability given y. 
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TABLE 3.9 


Discrete Distribution Functions for Bivariate Random Variables 
Outcome Outcome for X 
for Y 1 2 3 
Density 1 
1 0.075 0.150 0.075 
2 0.100 0.200 0.100 
3 0.075 0.150 0.075 
Density 2 
1 0.109 0.130 0.065 
2 0.087 0.217 0.087 
3 0.065 0.130 0.109 


3.11 
3-1E. 


3-2E. 


3-3E. 


3-4E. 


3-5E. 


3-6E. 


3-7E. 


Numerical Exercises 


What is the probability of rolling a number less than 5 given that two 
six-sided dice are rolled? 


Is the function 
Fee (5 Z (3.129) 
) 100 5 : 
a valid probability density function for « € [—5,5]? 


Derive the cumulative density function for the probability density 
function in Equation 3.129. 


Is the function 
€ [0, 1] 


x 

aes 

f(x)= x € [2,4] (3.130) 
0 


otherwise 
a valid probabilty density function? 


Given the probability density function in Equation 3.129, what is the 
probability that « < —0.75? 


Given the probability density function in Equation 3.130, what is the 
probability that 0.5 < a < 3? 


Are the density functions presented in Table 3.9 independent? 
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I (x.y) 


FIGURE 3.13 
Continuous Joint Distribution. 


3-8E. Consider the joint probability density function 
f (a, y) = 0.91442y exp (—x — y + 0.05,/ry) (3.131) 


as depicted in Figure 3.13. Are x and y independent? 


3-9E. Consider the joint distribution function 


f@ay ‘ for x € {0,2} y € (0,2). (3.132) 


Suppose that we transform the distribution by letting z = \/z. Derive 
the new distribution function and plot the new distribution function 
over the new range. 


3-10E. For Equation 3.132 compute the conditional probability density func- 
tion for x given that y > x. 


A 
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Given our discussion of random variables and their distributions in Chapter 3, 
we can now start to define statistics or functions that summarize the informa- 
tion for a particular random variable. As a starting point, Chapter 4 develops 
the definition of the moments of the distribution. Moment is a general term 
for the expected kth power of the distribution 


E [x*] = a a* f (x) dx. (4.1) 


The first moment of a distribution (i.e., & = 1) is typically referred to as the 
mean of the distribution. Further, the variance of a distribution can be derived 
using the mean and the second moment of the distribution. As a starting point, 
we need to develop the concept of an expectation rigorously. 


4.1 Expected Values 


In order to avoid circularity, let us start by defining the expectation of a 
function (g (X)) of a random variable X in Definition 4.1. 


Definition 4.1. The expected value or mean of a random variable g(X) 
denoted E [g (2)] is 
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[ g(x) f (x) dx if x is continuous 
Ps 7 S- g(x) fx if x is discrete. (42) 


cEX 


Hence, the expectation is a weighted sum — the result of the function is 
weighted by the distribution or density function. To demonstrate this con- 
cept, consider the mean of the distribution (i.e., g(X) = X): 


E [a] = i xf (x) da. (4.3) 


For example, derive the mean of the exponential distribution 


call 
E [a] =f —re dx 
0) aN 


= — (xe7> +f e "dex (44) 
0 
=—(re73 | =). 


The exponential distribution function can be used to model the probability 
of first arrival. Thus, f (2 = 1) would be the probability that the first arrival 
will occur in one hour. 

Given the definition of the expectation from Definition 4.1, the properties 
of the expectation presented in Theorem 4.2 follow. 


Theorem 4.2. Let X be a random variable and let a, b, and c be constants. 
Then for any functions g, (X) and gz (X) whose expectations exist: 


1. Elagy (X) + bg2(X) + ¢] = aE [gi (X)] + OE [g2 (X)] +e. 
2. If g, (X) > 0 for all X, then E[g, (X)] > 0. 

3. If 91 (X) 2 92(X) for all X, then E[g: (X)] 2 E [go (X)]. 
4. Ifa<gi(X) <b for all X, thena < E[gi (X)] < b. 


Result 1 follows from the linearity of the expectation. Result 2 can be further 
strengthened in Jensen’s inequality (ie., E[g (a)] > g[E [x]] if Og (x) /Ox > 0 
for all x). Result 3 actually follows from result 1 (ie., if a = 1 and b = —1 
then gi (x) — go (x) = 0 > Elq: (2)| — B[g2 («)] = 0). 

A critical concept in computing expectations is whether either the sum (in 
the case of discrete random variables) or the integral (in the case of continuous 
random variables) is bounded. To introduce the concept of boundedness, con- 
sider the expectation of a discrete random variable presented in Defintion 4.3. 
In this discussion we denote the sum of the positive values of x; as }), and 
the sum over negative values as )7_. 
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TABLE 4.1 
Expected Value of a Single-Die Roll 


Number Probability 2;P (2;) 


1 0.167 0.167 
2 0.167 0.333 
3 0.167 0.500 
4 0.167 0.667 
5 0.167 0.833 
6 0.167 1.000 
Total 3.500 


Definition 4.3. Let X be a discrete random variable taking the value x 
with probability P(a#;), i = 1,2,---. Then the expected value (expectation 
or mean) of X, denoted E[X], is defined to be E[X] = 0°, xiP (2;) if the 
series converges absolutely. 


1. We can write E[X] = )0, 2:P (a) + 0_ 2;P (a;) where in the first sum- 
mation we sum for 7 such that 2; > 0 ()), 2; > 0) and in the second 
summation we sum for 7 such that 7; <0 ()>_ 2; < 0). 


2. If So, a, P (aj) = 00 and S>_ 2; P (a;) = —oo then E[X] does not exist. 


3. If So, a;P (a) = co and S>_ 2;P (2;) is finite then we say E[X] = on. 


4. If })_ a;P (a) = —0oo and 5°, 2;P (a;) is finite then we say that E[X] = 
—00. 


Thus the second result states that —oo + oo does not exist. Results 3 and 4 
imply that if either the negative or positive sum is finite, the expected value 
is determined by —oo or oo, respectively. 

Consider a couple of aleatory or gaming examples. 


Example 4.4. Given that each face of the die is equally likely, what is the 
expected value of the roll of the die? As presented in Table 4.1, the expected 
value of a single die roll with values x; = {1, 2,3, 4,5, 6} and each value being 
equally likely is 3.50. 


An interesting aspect of Example 4.4 is that the expected value is not part of 
the possible outcomes — it is impossible to roll a 3.5. 


Example 4.5. What is the expected value of a two-die roll? This time the 
values of 7; are the integers between 2 and 12 and the probabilities are no 
longer equal (as depicted in Table 4.2). This time the expected value is 7.00. 


The result in Example 4.5 explains many of the dice games from casinos. 
Turning from the simple game expectations, consider an application from 
risk theory. 
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TABLE 4.2 
Expected Value of a Two-Die Roll 


Die 1 Die 2 Number 2;P(a2;) Die 1 Die 2 Number 2;P (a;) 


1 1 2 0.056 1 4 5 0.139 
2 1 3 0.083 2 4 6 0.167 
3 1 4 0.111 3 4 7 0.194 
4 1 5 0.139 4 4 8 0.222 
5 1 6 0.167 5 4 9 0.250 
6 a 7 0.194 6 4 10 0.278 
1 2 3 0.083 1 5 6 0.167 
2 2 4 0.111 2 5 7 0.194 
3 2 5 0.139 3 5 8 0.222 
4 2 6 0.167 4 5 9 0.250 
5 2 iG 0.194 5 5 10 0.278 
6 2 8 0.222 6 5 11 0.306 
1 3 4 0.111 1 6 i 0.194 
2 3 5 0.139 2 6 8 0.222 
3 3 6 0.167 3 6 9 0.250 
4 3 7 0.194 4 6 10 0.278 
5 3 8 0.222 5 6 11 0.306 
6 3 9 0.250 6 6 12 0.333 
Total 7.000 


Example 4.6. Expectation has several applications in risk theory. In general, 
the expected value is the value we expect to occur. For example, if we assume 
that the crop yield follows a binomial distribution, as depicted in Figure 4.1, 
the expected return on the crop given that the price is $3 and the cost per 
acre is $40 becomes $95 per acre, as demonstrated in Table 4.3. 


Notice that the expectation in Example 4.6 involves taking the expectation of 
a more general function (i.e., not simply taking the expected value of k = 1). 


B [pxX —C] = pew: — C]P (ai) (4.5) 
au 
or the expected value of profit. 

In the parlance of risk theory, the expected value of profit for the wheat 
crop is termed the actuarial value or fair value of the game. It is the value 
that a risk neutral individual would be willing to pay for the bet [82]. 

Another point about this value is that it is sometimes called the popula- 
tion mean as opposed to the sample mean. Specifically, the sample mean is 
an observed quantity based on a sample drawn from the random generating 
function. The sample mean is defined as 


1 N 
a ee (4.6) 
t=1 


=| 
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0.15 


Probability 


0.10 


0.00 


FIGURE 4.1 
Wheat Yield Density Function. 


Table 4.4 presents a sample of 20 observations drawn from the theoretical 
distribution above. Note that the sample mean for yield is smaller than the 
population mean (33.75 for the sample mean versus 45.00 for the population 
mean). It follows that the sample mean for profit is smaller than the population 
mean for profit. 

Another insight from the expected value and gambling is the Saint Peters- 
burg paradox. The Saint Petersburg paradox involves the valuation of gambles 
with an infinite value. The simplest form of the paradox involves the value of 
a series of coin flips. Specifically, what is the expected value of a bet that pays 
off $2 if the first toss is a head and 2 times that amount for each subsequent 


TABLE 4.3 
Expected Return on an Acre of Wheat 


15 0.0001 0.0016 0.0005 
20 0.0016 0.0315 0.0315 
25 0.0106 0.2654 0.3716 
30 0.0425 1.2740 2.1234 
35 0.1115 3.9017 7.2460 
40 0.2007 8.0263 16.0526 
45 0.2508 11.2870 23.8282 
50 0.2150 10.7495 23.6490 
55 0.1209 6.6513 15.1165 
60 0.0403 2.4186 5.6435 
65 0.0060 0.3930 0.9372 


Total 45.0000 95.0000 
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TABLE 4.4 
Sample of Yields and Profits 


Observation Yield Profit 


1 40 80 
2 40 80 
3 40 80 
4 50 110 
5 50 110 
6 45 95 
7 35 65 
8 25 35 
9 40 80 
10 50 110 
11 30 50 
12 35 65 
13 40 80 
14 25 35 
15 45 95 
16 35 65 
17 35 65 
18 40 80 
19 30 50 
20 45 95 


Mean 38.75 76.25 


head? If the series of coin flips is HHHT, the payoff is $8 (i.e., 2 x 2 x 2 or 
23). In theory, the expected value of this bet is infinity, but no one is willing 
to pay an infinite price. 


E[G] = > 22 = > Ise, (4.7) 


This unwillingness to pay an infinite price for the gamble led to expected 
utility theory. 

Turning from discrete to continuous random variables, Definition 4.7 
presents similar results for continuous random variables as developed for dis- 
crete random variables in Definition 4.3. 


Definition 4.7. Let X be a continuous random variable with density f (x). 
Then, the expected value of X, denoted E[X], is defined to be E[X] = 
f ae uf (a) dx if the integral is absolutely convergent. 


1. If {5° f (x) dx = 00 and Hee xf (x) dx = —oo, we say that the expecta- 
tion does not exist. 


2c1f [, af Ue) de= od and i xf (x) dz is finite, then E[X] = oo. 


3. If ile. af (x) dx = —co and f,~ xf (x) da is finite, then we write E[X] = 
—0o. 
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F(x),8(2) 


f (x)—Normal Distribution 


0.00 


FIGURE 4.2 
Standard Normal and Cauchy Distributions. 


As in our discussion of the boundedness of the mean in the discrete probabil- 
ity distributions, Definition 4.7 provides some basic conditions to determine 
whether the continuous expectations are bounded. To develop the point, con- 
sider the integral over the positive tails of two distribution functions: the 
standard normal and Cauchy distribution functions. The mathematical form 
of the standard normal distribution is given by Equation 3.107, while the 
Cauchy distribution can be written as 


1 


g(x) = a+") (4.8) 


Figure 4.2 presents the two distributions; the solid line depicts the standard 
normal distribution while the broken line depicts the Cauchy distribution. 
Graphically, these distributions appear to be similar. However, to develop the 
boundedness we need to evaluate 


hs 1 2 
—27/2 
zx f(«#)=> zx ——e dz 


x 
1 
rxg(tz)=> zx ———~dz 
g(2) | m(1+2z) 
Figure 4.3 presents the value of the variable times each density function. This 
comparison shows the mathematics behind the problem with boundedness — 
the value for the normal distribution (depicted as the solid line) converges 
much more rapidly than the value for the Cauchy distribution (depicted with 


(4.9) 
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xx f(x) 0.30 - 
xx g(x) 
0.25 - : 
xx f (x)— Normal Density 
0.20 
0.15 
xx g(x)— Cauchy Density 
0.10 
0.05 
0.00 : 
0.0 0.5 1.0 1.5 2.0 25 3.0 3.5 4.0 45 
x 
FIGURE 4.3 


Function for Integration. 


the broken line). The integral for each of the distributions (i.e., the left-hand 
sides of Equation 4.9) are presented in Figure 4.4. Confirming the intuition 
from Figure 4.3, the integral for the normal distribution is bounded (reach- 
ing a maximum around 0.40) while the integral for the Cauchy distribution 
is unbounded (increasing almost linearly throughout its range). Hence, the 


x 
zx 
[p2*S(2)a2 
. 0.45 i 
2 
jf zx g(z)dz a7 
PO A aoe ai tras ta te tattari teat otaetaes : 
c . . . a 
0.35 F Normal Distribution Zo 
0.30 
0.25 
0.20 
0.15 
0.10 
0.05 
0.00 1 a 
0.0 0.5 1.0 1.5 2.0 2.5 3.0 35 4.0 4.5 
x 
FIGURE 4.4 


Integrals of the Normal and Cauchy Expectations. 
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expectation for the normal exists, but the expectation of the Cauchy distri- 
bution does not exist (i.e., —oo + 00 does not exist). 

Moving to expectations of bivariate random variables, we start by defining 
the expectation of a bivariate function defined on a discrete random variable 
in Theorem 4.8. 


Theorem 4.8. Let (X,Y) be a bivariate discrete random variable taking value 
(xi, y;) with probability P (a:,y;), 1,79 = 1,2,--- and let d(a;,y;) be an arbi- 
trary function. Then 


EI(X, Y= >>> 4 (9s) P (wis). (4.10) 


Implicit in Theorem 4.8 are similar boundedness conditions discussed in Def- 
inition 4.3. Similarly, we can define the expectation of a bivariate continuous 
random variable as 


Theorem 4.9. Let (X,Y) be a bivariate continuous random variable with 
joint density function f (x,y), and let d(x, y) be an arbitrary function. Then 


Blo yI= ff ou) F(eu)aedy. (4.11) 
Next, consider a couple of special cases. First, consider the scenario where 
the bivariate function is a constant (¢ (x,y) = a). 


Theorem 4.10. Jf ¢(x,y) =a is a constant, then Ela] =a. 


Next, we consider the expectation of a linear function of the two random 
variables (¢ (a, y) = ax + By). 


Theorem 4.11. If X and Y are random variables and a and 8 are constants, 

E [aX + BY] = aE[X] + GE[Y]. (4.12) 
Finally, we consider the expectation of the multiple of two independent ran- 
dom variables (¢ (x, y) = xy where x and y are independent). 


Theorem 4.12. If X and Y are independent random variables, then 
E[XY] =E[X]E[Y]. 


The last series of theorems is important to simplify decision making under 
risk. In the crop example we have 
mT=pX—-C (4.13) 


where 7 is profit, p is the price of the output, X is the yield level and C is 
the cost per acre. The distribution of profit along with its expected value is 
dependent on the distribution of p, X, and C’. In the example above, we assume 
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that p and C are constant at j and C. The expected value of profit is then 
E[x = ¢(p, X,C)] = E[pX — C] = pE[X] -C. (4.14) 
As a first step, assume that cost is a random variable; then 
E [a = $(p, X,C)] = E[pX — C] = pE[a] — E[C). (4.15) 


Next, assume that price and yield are random, but cost is constant: 
Blr=0(,X,C))=Bpx|-C= | [pe f(eyyaedp-C. (4.16) 


By assuming that p and X are independent (e.g., the firm level assumptions), 


E [x = ¢(p, X,C)] =E[p]E[X] - C. (4.17) 


4.2 Moments 


Another frequently used function of random variables is the moments of the 


distribution function 
[oe) 


br (X) = E[X"] = / x" f (x) dx (4.18) 


—cCO 
where r is a non-negative integer. From this definition, it is obvious that the 
mean is the first moment of the distribution function. The second moment is 
defined as 


ba (X) = E[X?] = i x? f (x) dx. (4.19) 


The higher moments can similarly be represented as moments around the 
mean or central moment: 


ji (X) = E[X — E[X]]". (4.20) 


The first, second, third, and fourth moments of the uniform distribution 
can be derived as 


(4.21) 


The variance is then the second central moment. 
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Definition 4.13. The second central moment of the distribution defines the 


variance of the distribution 


V(X) =E[X - Ela]? = E [X?] - B[X]. 
The last equality is derived by 
E[X —E[X]?? =E[(X —- E[X]) (X -E[X])] 
=f [x? my ap ae ae 4 
= E[X?] - 2E[X]E[X]+E[x/’ 


= E[X?] —E[x]’. 
Put another way, the variance can be derived as 
V(X) = 0? = pa — (t)” = jira. 


From these definitions, we see that for the uniform distribution 


1 tA2 oh a 
V(X) = Ha — (mn) = 3 @ =e 


This can be verified directly by 


(4.22) 


(4.23) 


(4.24) 


(4.25) 


(4.26) 


4.3. Covariance and Correlation 


The most frequently used moments for bivariate and multivariate random 
variables are their means (again the first moment of each random variable) 
and covariances (which are the second own moments and cross moments with 
the other random variables). Consider the covariance between two random 


variables X and Y presented in Definition 4.14. 
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Definition 4.14. The covariance between two random variables X and Y 
can be defined as 


Cov (X,Y) = E[(X — E[X]) (Y — E[Y])] 


=E[XY — XE[Y]-E[a] Y + E[X][Y]] 
(4.27) 
= E[XY]—- E[X]E[Y]- E[X]E[Y]+E[X]E[Y] 


= E[XY]—-E[X]E[Y]. 


Note that this is simply a generalization of the standard variance formulation. 
Specifically, letting Y — X yields 


Cov (XX) = E[X X] - E[X] E[X] 
(4.28) 
= E[X?] - (E[x))’. 


Over the next couple of chapters we will start developing both theoretical 
and empirical statistics. Typically, theoretical statistics assume that we know 
the parameters of a known distribution. For example, assume that we know 
that a pair of random variables are distributed bivariate normal and that 
we know the parameters of that distribution. However, when we compute 
empirical statistics we assume that we do not know the underlying distribution 
or parameters. The difference is the weighting function. When we assume 
that we know the distribution and the parameters of the distribution, we will 
use the known probabilty density function to compute statistics such as the 
variance and covariance coefficients. In the case of empirical statistics, we 
typically assume that observations are equally likely (i.e., typically weighting 
by 1/N where there are N observations). 

These assumptions have implications for our specification of the vari- 
ance/covariance matrix. For example, assume that we are interested in the 
theoretical variance/covariance matrix. Assume that the joint distribution 
function can be written as f (x,y|@) where 6 is a known set of parameters. 
Using a slight extension of the moment specification, we can compute the first 
moment of both x and y as 


[lx (0) = [ i xf (x, y| 0) dxdy 


nin [| 


where x € [aa,X»] and y € [ya, yo]. Note that we are dropping the subscript 
1 to denote the first moment and use the subscript to denote the variable 


: (4.29) 
yf (2, y| 9) dady. 


a 
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(42 (8) < p41 (9)). Given these means (or first moments), we can the define 
the variance for each variable as 


Yb 
02 (0) = Onn (0 =f ff (x — Ue (0)? f (2, y| 8) deed 
(4.30) 
2 Yb 
Oy (8) = yy (8 =f f (y — My (9) f (2, y| @) dady. 


The covariance coefficient can then be expressed as 
Yb 
a2y(0)= ff (w= He (8) (Y= Hy (0) (e918) drdy. (4.81) 


Notice that the covariance function is symmetric (i.e., Gry (@) = Gya (8)). 
The coefficients from Equations 4.30 and 4.31 can be used to populate the 
variance/covariance matrix conditional on 6 as 


Orn (8) Ory (9) 
u (0) = | FiO): Oxy @)\ (4.32) 


Example 4.15 presents a numerical example of the covariance matrix using a 
discrete (theoretical) distribution function. 


Example 4.15. Assume that the probability for a bivariate discrete random 
variable is presented in Table 4.5. We start by computing the means 


[lq = (0.167 + 0.083 + 0.167) x 1+ (0.083 + 0.000 + 0.083) 
x 0 + (0.167 + 0.083 + 0.167) x —1 
= 0.417 x 1+ 0.167 x 0+0.417 x -1=0. 
[ly = (0.167 + 0.083 + 0.167) x 1 + (0.083 + 0.000 + 0.083) (4.33) 
x 0+ (0.167 + 0.083 + 0.167) x —1 
= 0.417 x 1+0.167 x 0+0.417 x -1=0. 


TABLE 4.5 
Discrete Sample 
Y Marginal 
xX —1I 0 I Probability 
—1 0.167 0.083 0.167 0.417 
0 0.083 0.000 0.083 0.167 
1 0.167 0.083 0.167 0.417 


Marginal 
Probability 0.417 0.167 0.417 
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Given that the means for both variables are zero, we can compute the variance 
and covariance as 


VixJ= SO S> a? P (xi, yi) = (—1)” x 0.167 + (0) x 0.083 + --- 
t€{1,0,1} jE{1,0,1} 
(1)? x 0.167 = 0.834 


vie SS) yf P (ai, yi) = (-1)* x 0.167 + (0)? x 0.083 + --- 
t€{1,0,1} je{1,0,1} 
(1)” x 0.167 = 0.834 


Cov [X,Y] = S- S- viyiP (ai,yi) = —1 x —1 x 0.167+ 
i€{1,0,1} j€{1,0,1} 
—1x 0x 0.083 +---1x 1x 0.167 = 0. 


(4.34) 
Thus, the variance matrix becomes 
0.834 0.000 
pa | 0.000 0.834 oe) 


The result in Equation 4.35 allows for an additional definition of independence 
— two distributions are independent if their covariance is equal to zero. 
From a sample perspective, we can compute the variance and covariance 


N 
Ss =a xu — 7? 
rf i a i 
N 4 u 
i=l 


as 


N 
8 => coe is (4.36) 
vy ~ VF YY . 
i=1 


Heel 
Say = 7 Ss Liys — LY 
i=l 


where = 1/N ae x, and y¥ = 1/N yy yi. Typically, a lower case Ro- 
man character is used to denote individual sample statistics. Substituting the 
sample measures into the variance matrix yields 


S= Sea Say | _ W yy LiL, — LX x ey Liyi — LY 
a S S = lye ge ap by ape may 
yx yy N ey Yiti-—YX KW eer Yi — YY 
(4.37) 
AAD St Silk | LE 
N N _ = 
iat Viti jn ViVi 
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where the upper case Roman letter is used to denote a matrix of sample statis- 
tics. While we will develop the matrix notation more completely in Chapter 
10, the sample covariance matrix can then be written as 


1 EB bd | TT Yl | 
SsS=— Sanat, - sete = 
N| YM «t+ YN ee dine 


e 8) 


Jz Tale (4.38) 


Next, we need to develop a couple of theorems regarding the variance 
of linear combinations of random variables. First consider the linear sum or 
difference of two random variables X and Y. 


Theorem 4.16. V(X £Y) =V(X)+V(Y) + Cov (X,Y). 


Proof. 


V[X4Y]=E[(X4Y)(X4Y)| 
=E[XX 42XY+YY] (4.39) 
=E[XX]+E[YY]+E[XY] 
=V(X)+V(Y)+2Cov (X,Y). 


Note that this result can be obtained from the variance matrix. Specifically, 
X +Y can be written as a vector operation: 


ie: yj[i]=x+. (4.40) 


Given this vectorization of the problem, we can define the variance of the sum 
as 


7 [te voy LT] = [ome + ony oy +on 1 4 | 2s 


= Ong + 20zy + Tyy- 


Next, consider the variance for the sum of a collection of random variables X; 
where i= 1,---N. 


Theorem 4.17. Let X;, 1 = 1,2,--- be pairwise independent (i.e., where 
oij = 0 fori #J, 1,7 =1,2,---N). Then 


N N 
V (>: x) = V Oa): (4.42) 


102 Mathematical Statistics for Applied Econometrics 


Proof. The simplest proof to this theorem is to use the variance matrix. Note 
in the preceding example, if X and Y are independent, we have 


Orn Oxy 1] 
cots gles es 
if oz, = 0. Extending this result to three variables implies 


/ 


1 Ou O12 013 1 
1 021 022 023 L | =o041 + 2042 + 2013 + o22 + 2093 + 033 
1 031 032 033 1 


(4.44) 
if the xs are independent, the covariance terms are zero and this expression 
simply becomes the sum of the variances. 


One of the difficulties with the covariance coefficient for making intuitive 
judgments about the strength of the relationship between two variables is that 
it is dependent on the magnitude of the variance of each variable. Hence, we 
often compute a normalized version of the covariance coefficient called the 
correlation coefficient. 


Definition 4.18. The correlation coefficient for two variables is defined as 
_ Cov(X,Y) 


Corr (X,Y) = ey cae 
LEV YY 


(4.45) 


Note that the covariance between any random variable and a constant is 
equal to zero. Letting Y equal zero, we have 


E[(X — E[X]) (Y — E[Y])] = E[(X — E[X]) (0)] = 0. (4.46) 


It stands to reason the correlation coefficient between a random variable and 
a constant is also zero. 

It is now possible to derive the ordinary least squares estimator for a linear 
regression equation. 


Definition 4.19. We define the ordinary least squares estimator as that set 
of parameters that minimizes the squared error of the estimate. 


min=E lv tie Bx) = min E [Y?—2aY—28XY +07 + 2apX + 6?X?]. 


a,B 
(4.47) 
The first order conditions for this minimization problem then become 


O5 _ _oB[Y] + 2a + 26B [X] = 0 
08 Oa (4.48) 
i —2E [XY] + 2aE [X] + 26E [X?] =0. 
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Solving the first equation for a@ yields 
a=E[Y]—- 6E[X]. (4.49) 
Substituting this expression into the second first order condition yields 


—E[XY] + (E[Y] — BE[X]) E[X] + BE [xX] =0 
_E[XY]+E[Y]E[X]+8 (E [x?] -(E [x1)”) =0 


—Cov (X,Y) + 6V(X) =0 (4.50) 
_ Cov (X,Y) 
pcama 2c oie 


Theorem 4.20. The best linear predictor (or more exactly, the minimum 
mean-squared-error linear predictor) of Y based on X is given by a* + B*X, 
where a* and 8* are the least square estimates where a* and 6* are defined 
by Equations 4.49 and 4.50, respectively. 


4.4 Conditional Mean and Variance 


Next, we consider the formulation where we are given some information about 
the bivariate random variable and wish to compute the implications of this 
knowledge for the other random variable. Specifically, assume that we are 
given the value of X and want to compute the expectation of ¢(X,Y) given 
that information. For the case of the discrete random variable, we can define 
this expectation, called the conditional mean, using Definition 4.21. 


Definition 4.21. Let (X,Y) be a bivariate discrete random variable taking 
values (@;,y;) i,j = 1,2,---. Let P(y,|X) be the conditional probability of 
Y =y; given X. Let ¢(a;,y,;) be an arbitrary function. Then the conditional 
mean of ¢(X,Y) given X, denoted E[¢ (X,Y) |X] or by Ey|x [¢(X,Y)], is 
defined by 


Ey|x [¢(X,Y)] = oe b(X, yi) P (yilX). (4.51) 


Similarly, Definition 4.22 presents the conditional mean for the continuous 
random variable. 


Definition 4.22. Let (X,Y) be a bivariate continuous random variable with 
conditional density f (y|a). Let ¢(x,y) be an arbitrary function. Then the 
conditional mean of ¢(X,Y) given X is defined by 


Byix (XY) = [ ” 6 (X,y) F (LX) dy. (4.52) 
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Building on these definitions, we can demonstrate the Law of Iterated 
Means given in Theorem 4.23. 


Theorem 4.23 (Law of Iterated Means). E[¢(X,Y)] = ExEy x [6(X,Y)]- 
(where the symbol E, denotes the expectation with respect to X ). 


Proof. Consider the general expectation of (X,Y) assuming a continuous 
random variable. 


E(x Y= | i “ bCaayd Gahinds (4.53) 


as developed in Theorem 4.9. Next we can group without changing the result, 
yielding 
Lb Yb 
Blox Y= [| fon sway ac. (4.54) 
La Va 
Notice that the term in brackets is the Ey;x [@(X,Y)] by Definition 4.22. To 
complete the proof, we rewrite Equation 4.54 slightly. 


Xb 


E[6(X,Y)| = | Ey)x [6(X.Y)| f (a) dx (4.55) 


La 


basically rewriting f (x,y) = f (y| 2) f (2). 


Building on the conditional means, the conditional variance for ¢ (X,Y) 
is presented in Theorem 4.24. 


Theorem 4.24. 


V(9(X, Y)) = Ex[Vy)x[0(X, Y)]] + Vx [Ey)xlo(X, Y)]]- 4.56) 
Proof. 
Vy\x[¢] = Ey |x [¢7] — (Ey|x[¢])? 4.57) 
implies 
Ex[Vyjx(¢)] = E[¢"] — Ex[Ey)x[¢]]- 4.58) 
By the definition of conditional variance, 
Vx (Evix [4]) = Ex [Evix [6] - (4). 4.59) 
Adding these expressions yields 
Ex [Vy|x (¢)] + Vx (Ey x [d]) = E [97] — El¢]) = V(). 4.60) 


Finally, we can link the least squares estimator to the projected variance 
in Theorem 4.20 using Theorem 4.25. 


Theorem 4.25. The best predictor (or the minimum mean-squared-error pre- 
dictor) of Y based on X is given by E[Y |X]. 
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4.5 Moment-Generating Functions 


The preceding sections of this chapter have approached the moments of ran- 
dom variables one at a time. One alternative to this approach is to define a 
function called a Moment Generating Function that systematically gen- 
erates the moments for random variables. 


Definition 4.26. Let X be arandom variable with a cumulative distribution 
function F'(X). The moment generating function of X (or F'(X)), denoted 
Mx (t), is 

Mx (t) =E[e’*] (4.61) 


provided that the expectation exists for t in some neighborhood of 0. That is, 
there is an h > 0 such that, for all t in —h <t <h, E [e**] exists. 


If the expectation does not exist in a neighborhood of 0, we say that the 
moment generating function does not exist. More explicitly, the moment gen- 
erating function can be defined as 


Mx (t) = / e'* f (x) dx for continuous random variables, and 


(4.62) 
M, (t) = s e'*P [X = 2] for discrete random variables. 


Theorem 4.27. If X has a moment generating function Mx (t), then 
E[X"] = M™ (0) (4.63) 


where we define 
n d” 
M (>) (0) = — 


— Mx (t 
ae x (t) 


(4.64) 


liso . 


First note that e’* can be approximated around zero using a Taylor series 
expansion. 
Mx (t) =E [e”] =E [e® + te (x — 0) + 


1 1 
spe aU tre GSU) a 


(4.65) 
i i 
=1+E[2]t+E [27] 5 tle" ae 
Note for any moment n 
qd” 
&) (t) = qr Mx (t)=Ele"| +E [2 ]t+ Ela 7]? +---. (4.66) 


Thus, as t > 0, 
M®™ (0) =E[zx”]. (4.67) 
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Definition 4.28. Leibnitz’s Rule: If f (x,0), a(@), and b(6) are differen- 
tiable with respect to 0, then 


b(0) 
Bf, FDA = FO) 8) F598) +F (0) zy) 


ie a) | 
—f (x,@) dz. 
a(@) oe 


Lemma 4.29. Casella and Berger’s proof: Assume that we can differen- 
tiate under the integral using Leibnitz’s rule; we have 


(4.68) 


Sux (Q=5 [ef (a)ae 
Z a (5e") f (x) de (4.69) 
= [se Fa)ae 


Letting t > 0, this integral simply becomes 


i- uf (a) dx = E [a] (4.70) 


(7. 


This proof can be extended for any moment of the distribution function. 


4.5.1 Moment-Generating Functions for Specific 
Distributions 


The moment generating function for the uniform distribution is 
b ta bt at 
e€ 1 1 b Ee —e 
Mx (t) = dx = See = 4.71 
x(t) | b-a ae © F t(b—a) oe) 


Following the expansion developed earlier, we have 


(1-1) + (b-a)t+ 5 (Pa?) P+ 2 (8 a8) B+. 


Mx (t) = a 
= (b” — a’) t? (0? — a*) B 
=1+ 2(b—a)t + Goma Asc re 
=14 1(b-a)(b+a)t L(b-a) (Bh +ab+a") e 


2 (b-a) t 6 (b— a) t 


=14+5(at+b)t+% (ar? +ab4+B?)?+---. 
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Letting b = 1 and a = 0, the last expression becomes 


i. 4 1 
Mx (t) =14+ t+ =? fees y 4.73 
x (t) 5 6. 54 (4.73) 


The first three moments of the uniform distribution are then 


1 
My? = 3 
M@ =19=1 (4.74) 
(3) a 
M@ = 16=1. 


Mx (t) = ete 2 07 de 
ov 2T Joo 
(4.75) 
1 (x p)” 
= tx — = d 
oa ee exp BG x 


Focusing on the term in the exponent, we have 


: 1(a—p)? _ 1 (a — p)? — 2taxo? 
aa 2 


1a? — 2a + p? — 2tro 


— 5 2 
(4.76) 
1x? —2 (xp + tao”) + py? 
~ o 
1a? — Qe (u | to”) + pu 


The next step is to complete the square in the numerator. 


a? —2x(pt+to?) +p? +c=0 


(0 — (n+t0))? =0 
(4.77) 


a? — 22 (wt to) + p? + 2to7p + t?0* = 0 


c= 2to* w+ po. 
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The complete expression then becomes 


2 


; 1(@— yp)” _ 1 (x — p— to?) — 2no*t — ot? 
a a o 
(4.78) 
—u—to? 
a PN ics + 2242, 


The moment generating function then becomes 
1 ae pb to”) 
Mx (t) = exp (ut 5 + =o? r)s “op dx 
( ) oV2n J o? 


1 
= exp (1 + sot’) : 


(4.79) 
Taking the first derivative with respect to t, we get 
1 
MY (t) = (w+ ot) exp (ut + 5a) : (4.80) 
Letting t + 0, this becomes 
M®) = p. (4.81) 


The second derivative of the moment generating function with respect to t 
yields 


1 
M®) (t) = 0? exp (1 + 5o°t*) + 


(4.82) 
1 
(u + o*t) (u + o°t) exp (ut + 50) ‘ 
Again, letting t > 0 yields 
M®) (0) =o? +42. (4.83) 


Let X and Y be independent random variables with moment generating 
functions Mx (t) and My (t). Consider their sum Z = X + Y and its moment 
generating function. 


Mz (t) = E[e*] =B[e@+| =F [e'*e'¥] = 
(4.84) 
E [e!] E [et] = Mx (1) My (1). 


We conclude that the moment generating function for two independent ran- 
dom variables is equal to the product of the moment generating functions of 
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each variable. Skipping ahead slightly, the multivariate normal distribution 
function can be written as 
1 


1/2 1 ro-l 
fe)== 2 Pew (-5(e- WE (@- 1), (4.85) 


In order to derive the moment generating function, we now need a vector f. 
The moment generating function can then be defined as 


Pane leet: 
Mx (t) = exp Gi + 5st) (4.86) 


Normal variables are independent if the variance matrix is a diagonal matrix. 
Note that if the variance matrix is diagonal, the moment generating function 
for the normal can be written as 


2 wcities of 0-0 : 
Mg (@) =exp | wt+ at OS «ag 20: ~ Wet 
Oi 2:0: see 


1 
= exp (vat t pate + usta + 5 (tlo? + thos 4 #02) 


1 1 1 
= exp (mt + 57) + (vat + 528 + (vat + 5038) ) 


= Mx, (t) Mx, (t) Mx, (2). 


(4.87) 


4.6 Chapter Summary 
e The moments of the distribution are the kth power of the random variable. 
— The first moment is its expected value or the mean of the distribution. 
It is largely a measure of the central tendancy of the random variable. 


— The second central moment of the distribution (i.e., the expected 
value of (a — 1)”) is a measure of the expected distance between the 
value of the random variable and its mean. 


e The existence of an expectation is dependent on the boundedness of each 
side of the distribution. 


— Several random variables have an infinite range (i.e., the normal dis- 
tribution’s range is x € (—co, 00)). 
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— The value of the expectation depends on whether the value of the 
distribution function converges to zero faster than the value of the 
random variable. 


— Some distribution functions such as the Cauchy distribution have no 
mean. 


e The variance of the distribution can also be defined as the expected 
squared value of the random variable minus the expected value squared. 


— This squared value specification is useful in defining the covari- 
ance between two random variables (i.e., Cov(X,Y) = E[XY] — 
E[X]E[Y]. 


— The correlation coefficient is a normalized form of the covariance 
pxy = Cov [X,Y] //V[X] V[Y]. 


e This chapter also introduces the concepts of sample means and variances 
versus theoretical means and variances. 


e Moment generating functions are functions whose derivatives give the mo- 
ments of a particular function. Two random variables with the same mo- 
ment generating function have the same distribution. 


4.7 Review Questions 


4-1R. What are the implications of autocorrelation (i.e., y, = ao tara, ter + 
pézr_1) for the covariance between t and t+ 1? 


4-2R. What does this mean for the variance of the sum y and y%,_1? 


4-3R. Derive the normal equations (see Definition 4.19) of a regression with 
two independent variables (i.e., yj = A + A121; + A2%2; + &). 


4-4R. Use the moment generating function for the standard normal distri- 
bution to demonstrate that the third central moment of the normal 
distribution is zero and the fourth central moment of the standard nor- 
mal is 3.0. 


4.8 Numerical Exercises 


4-1E. What is the expected value of a two-die roll of a standard six-sided die? 
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4-2E. What is the expected value of the roll of an eight-sided die and a six- 
sided die? 


4-3E. Compute the expectation and variance of a random variable with the 
distribution function 


f(x) = ji (27-1) xe (-1,1). (4.88) 

4-4E. What is the expected value of the negative exponential distribution 
f (a A) = Aexp (Ax) x € (0,00)? (4.89) 
4-5E. Compute the correlation coefficient for the rainfall in August— 


September and October-December using the data presented in Ta- 
ble 2.1. 
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At several points in this textbook, we have encountered the normal distribu- 
tion. I have noted that the distribution is frequently encountered in applied 
econometrics. Its centrality is due in part to its importance in sampling the- 
ory. As we will develop in Chapter 6, the normal distribution is typically the 
limiting distribution of sums of averages. In the vernacular of statistics, this 
result is referred to as the Central Limit Theorem. However, normality 
is important apart from its asymptotic nature. Specifically, since the sum 
of a normal distribution is also a normal distribution, we sometimes invoke 
the normal distribution function in small samples. For example, since a typ- 
ical sample average is simply the sum of observations divided by a constant 
(i.e., the sample size), the mean of a collection of normal random variables is 
also normal. Extending this notion, since ordinary least squares is simply a 
weighted sum, then the normal distribution provides small sample statistics 
for most regression applications. 

The reasons for the importance of the normal distribution developed above 
are truly arguments that only a statistician or an applied econometrician could 
love. The normal distribution in the guise of the bell curve has spawned a 
plethora of books with titles such as Bell Curve: Intelligence and Class Struc- 
ture in American Life, Intelligence, Genes, and Success: Scientists Respond 
to the Bell Curve, and Poisoned Apple: The Bell-Curve Crisis and How Our 
Schools Create Mediocrity and Failure. Whether or not the authors of these 
books have a firm understanding of the mathematics of the normal distribu- 
tion, the topic appeals to readers outside of a narrow band of specialists. The 
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popular appeal is probably due to the belief that the distribution of observable 
characteristics is concentrated around some point, with extreme characteris- 
tics being relatively rare. For example, an individual drawn randomly from a 
sample of people would agree that height would follow a bell curve regardless 
of their knowledge of statistics. 

In this chapter we will develop the binomial distribution based on 
simple Bernoulli distributions. The binomial distribution provides a bridge 
between rather non-normal random variables (i.e., the X = 0 and X = 1 
Bernoulli random variable) and the normal distribution as the sample size be- 
comes large. In addition, we formally develop the bivariate and multivariate 
forms of the normal distribution. 


5.1 Bernoulli and Binomial Random Variables 


In Chapter 2, we developed the Bernoulli distribution to characterize a random 
variable with two possible outcomes (i.e., whether a coin toss was a heads 
(X = 1) or a tails (X = 0) or whether or not it would rain tomorrow). In 
these cases the probability distribution function P [X] can be written as 


P[X] =p" (1—p)* (5.1) 
where p is the probability of the event occurring. Extending this basic formu- 
lation slightly, consider the development of the probability of two independent 
Bernoulli events. Suppose that we are interested in the outcome of two coin 
tosses or whether it will rain two days in a row. Mathematically we can specify 


this random variable as Z = X + Y. If X and Y are identically distributed 
(both Bernoulli) and independent, the probability becomes 


P [X,Y] =P[X]P[Y] = pp" (1 —p)'* (1 —p) 
(5.2) 
= pty (1— pe, 


This density function is only concerned with three outcomes, Z = X +Y = 
{0, 1,2}. Notice that there is only one way each for Z = 0 or Z = 2. Specif- 
ically, Z = 0, X = 0, and Y = O. Similarly, Z7 = 2, X = 1, and Y = 1. 
However, for Z = 1, either X = 1 and Y = 0 or X = 0 and Y = 1. Thus, we 
can derive 


P[Z=0|=p°(1-p)”” 
P[Z =1]=P[X =1,¥Y =0]+P[X=0,Y=]] 
—p ter dap (5.3) 
= 2p\(1—p)! 
P[Z=2]=p?(1-p)’. 


1+0 el aie 


=p 
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Next we expand the distribution to three independent Bernoulli events 
where Z=W+4+X+Y = {0,1,2,3}. 


P[Z] = P[W, X,Y] 


= P[W]P[X]P[Y] 


= p’p*p¥ (1—p)" (L—p)* (1—p)® (5.4) 


wtaet+y (1 _ »\s-w-£-y 


=p 7) 


=p*(1—p)?*. 


Again, there is only one way for Z = 0 and Z = 3. However, there are now 
three ways for Z = 1 or Z = 2. Specifically, Z = 1ifW =1, X =1,orY =1. 
In addition, Z7 = 2ifW =1, X =1, and Y = 0, or if W =1, X = 0, and 
Y =1,orifW =0, X =1, and Y = 1. Thus the general distribution function 
for Z can now be written as 


P[Z =0) = 9 (1— 5) 
P [Z = 1] = pron) (1 = pyre a (l = a a 


potott (1 — py? 0" = ap! (1—p)’ 
(5.5) 


PiZ= 3) = pire (he Oe pito+t (1 = ane a 
pore (1 — p) a ai eae ren 3p" (1 —p) 


P([Z=3] =p*(1—p)”. 


Based on our discussion of the Bernoulli distribution, the binomial distri- 
bution can be generalized as the sum of n Bernoulli events. 


Piz=r1=( 7% Joa-a" (5.6) 


where C7’ is the combinatorial of n and r developed in Defintion 2.14. Graph- 
ically, the combinatorial can be depicted as Pascal’s triangle in Figure 5.1. 
The relationship between the combinatorial and the binomial function can be 
developed through the general form of the polynomial. Consider the first four 
polynomial forms: 


(a+b)'=a+b 
(a+b)? =a? + 2ab + 0? 
(a+b)? =a? + 3a7b + 3ab? + 6 
(a+ )* = a4 + 4a3b + 6a7b? + 4ab? + b4. 


(5.7) 
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FIGURE 5.1 
Pascal’s Triangle. 


Based on this sequence, the general form of the polynomial form can then be 
written as 


(a+b)” = 5 > Cramer. (5.8) 
r=1 


As a first step, consider changing the polynomial from a+ b to a — b. Hence, 
(a — b)” can be written as 


(a — b)" = (a + (—1)b)" 


a . Nar = n—-4yn—r 
ge (5.9) 


=5 > (-1)"" crane”. 
This sequence can be linked to our discussion of the Bernoulli system by 


letting a = p and b = 1—p. Thus, the binomial distribution X ~ B(n,p) can 
be written as 


P(X =k) =Cp* (1—p)”*. (5.10) 
In Equation 5.5 above, n = 3. The distribution function can be written as 
3 _ 
P[Z=0])=( 6 }p°(1-p)” 
3 = 
P([Z=1=( 5 }p'(-p)*" 
3 (5.11) 
3-2 
P[Z=2)=( 5 |p? (1—-p) 
3 = 
P[Z =3] = ( : ) app ‘ 
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Next, recalling Theorem 4.11, (E[aX + bY] = aE[X] + bE[Y]), the ex- 
pectation of the binomial distribution function can be recovered from the 
Bernoulli distributions. 


(5.12) 


In addition, by Theorem 4.17 


V os x) = SV (Xi). (5.13) 


Thus, the variance of the binomial is simply the sum of the variances of the 
Bernoulli distributions or n times the variance of a single Bernoulli distribu- 


tion. 
V(X) =E[X a 
= a-w?) + a—9! | —# 
=p—p" ec p) (5.14) 


5.2 Univariate Normal Distribution 


In Section 3.5 we introduced the normal distribution function as 


1 1 eae 
f(z)= exp . co <4<w,o>Q0. (5.15) 
20 2 


oO 


In Section 3.7 we demonstrated that the standard normal form of this distri- 
bution written as 


f(z)= e 2 (5.16) 


integrated to one. In addition, Equations 3.116 and 3.117 demonstrate how 
the more general form of the normal distribution in Equation 5.15 can be 
derived from Equation 5.16 by the change in variables technique. 
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Theorem 5.1. Let X be N (1,07) as defined in Equation 5.15; then E[X] = ps 
and V [X] =0o?. 


Proof. Starting with the expectation of the normal function as presented in 
Equation 5.15, 


oo OV 2 


Using the change in variables technique, we create a new random variable z 
such that 


E[X] = is : x exp -j Le f] dx. (5.17) 


f= 


Z= - >L= 20+ ph (5.18) 
dx = odz. 
Substituting into the original integral yields 


E[X] = ie = (zo + 1) exp -32"| dz 


(5.19) 
ol 1 | uf 1 |- 1 | 
ZeXp 2 | 4 exp |—=z* | dz. 
pe rae Perle 
Taking the integral of the first term first, we have 
nova | iE 1 
| zee” I-35” Jaco [ ZeXp -32"| dz 

(5.20) 


il Co 
=C (- exp -52"| = 0. 
2 —oco 
The value of the second integral becomes yz by polar integration (see Section 
3.7). 
The variance of the normal is similarly defined except that the initial 
integral now becomes 


1 co 


OV 27 Joo 


(5.21) 
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This formulation is then completed using integration by parts. 


u=—-z>d=-1 


1 1 
du = —zexp -32"| => v=exp I-52 


= 1 1 as = 1 
hes 22 €XDP -52"| dz =— (<exp |-52"| Lee + i exp -52"| dz. 
(5.22) 
The first term of the integration by parts is clearly zero, while the second is 


defined by polar integral. Thus, 


Vix]=o+o? [- : 


—oo 20 


exp -52"| = (5.23) 


Theorem 5.2. Let X be distributed N (1,07) and let Y = a+ BY. Then 
YN (a + Bp, B20"). 


Proof. This theorem can be demonstrated using Theorem 3.33 (the theorem 
on changes in variables). 


d -1 
gy) =f [o* &] | : (5.24) 
In this case > 
(2) = a+ Br 4 9 (y) = T° 
ee a) : 1 (5.25) 
dy B 
The transformed normal then becomes 
TY) 
7 1 Deke 
g(y) =, o|B|\V 20 exp 2 o? 
(5.26) 


apt a 1 (y—a— By)” 
o|8|V2n 2 076? , 


Note that probabilities can be derived for any normal based on the 
standard normal integral. Specifically, in order to find the probability that 
X ~N(10,4) lies between 4 and 8 (P [4 < X < 8]) implies 


P[4<X <8)=P[X <8]—P[X <4]. (5.27) 
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Transforming each boundary to standard normal space, 


= Hoe 10 = —6 hy 
ZY 2 2 3 
(5.28) 
v2 — 10 —2 
vp) = = = 1. 
2 2 
Thus, the equivalent boundary becomes 
P[-3<z<-1])=P[z< -1]-—P[z < -3) (5.29) 


where z is a standard normal variable. These values can be found in a standard 
normal table as P [z < —1] = 0.1587 and P [z < —3] = 0.0013. 


5.3 Linking the Normal Distribution to the Binomial 


To develop the linkage between the normal and binomial distributions, con- 
sider the probabilities for binomial distributions with 6, 9, and 12 draws pre- 
sented in columns two, three, and four in Table 5.1. To compare the shape 
of these distributions with the normal distribution, we construct normalized 


TABLE 5.1 
Binomial Probabilities and Normalized Binomial Outcomes 
Random Probability Normalized Outcome 
Variable 6 9 12 6 9 12 
0 0.016 0.002 0.000 —2.449 -—3.000 —3.464 
1 0.094 0.018 0.003 —1.633 —2.333 —2.887 
2 0.234 0.070 0.016 —0.816 —1.667 —2.309 
3 0.313 0.164 0.054 0.000  —1.000 —1.732 
4 0.234 0.246 0.121 0.816 —0.333 —1.155 
5 0.094 0.246 0.193 1.633 0.333 —0.577 
6 0.016 0.164 0.226 2.449 1.000 0.000 
7 0.070 0.193 1.667 0.577 
8 0.018 0.121 2.333 1.155 
9 0.002 0.054 3.000 1.732 
10 0.016 2.309 
11 0.003 2.887 
12 0.000 3.464 
Mean 3.00 4.50 6.00 


Variance 1.50 2.25 3.00 
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FIGURE 5.2 
Comparison of Binomial and Normal Distribution. 
outcomes for each set of outcomes. 
1 
Ly Le. 
j2 =e (5.30) 


Me, r= (ut,)° 


where ju, and ju2, are the first and second moments of the binomial distri- 
bution. Basically, Equation 5.30 is the level of the random variable minus the 
theoretical mean divided by the square root of the theoretical variance. Thus, 
taking the outcome of x = 2 for six draws as an example, 


~ 2-3 
i —— 
V1.5 


We frequently use this procedure to normalize random variables to be consis- 
tent with the standard normal. The sample equivalent to Equation 5.30 can 
be expressed as 


= —0.816. (5.31) 


ge (5.32) 
85 

Figure 5.2 presents graphs of each distribution function and the standard 
normal distribution (i.e., « ~ N [0,1]). The results in Figure 5.2 demonstrate 
the distribution functions for normalized binomial variables converge rapidly 
to the normal density function. For 12 draws, the two distributions are almost 

identical. 
This convergence to normality extends to other random variables. Table 5.2 
presents 15 samples of eight uniform draws with the sum of the sample and 
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Probability Density - f(x) 0.45 + 


-5 -4 3 -2 -1 0 1 2 3 4 5 


FIGURE 5.3 
Limit of the Sum of Uniform Random Variables. 


the normalized sum using the definition in Equation 5.30. Figure 5.3 presents 
the distribution of the normalized random variable in Table 5.2 for sample 
sizes of 100, 200, 400, and 800. These results demonstrate that as the sample 
size increases, the empirical distribution of the normalized sum approaches 
the standard normal distribution. 

It is important to notice that the limiting results behind Figures 5.2 and 5.3 
are similar, but somewhat different. The limiting result for the binomial dis- 
tribution presented in Figure 5.2 is due to a direct relationship between the 
binomial and the normal distribution. The limiting result for the sum of uni- 
form random variables in Figure 5.3 is the result of the central limit theorem 
developed in Section 6.5. 


5.4 Bivariate and Multivariate Normal Random 
Variables 


In addition to its limiting characteristic, the normal density function provides 
for a simple way to conceptualize correlation between random variables. To 
develop this specification, we start with the bivariate form of the normal distri- 
bution and then expand the bivariate formulation to the general multivariate 
form. 
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5.4.1 Bivariate Normal Random Variables 


In order to develop the general form of the bivariate normal distribution, 
consider a slightly expanded form of a normal random variable: 


f (2| 4,07) = = exp e: =H) | : (5.33) 


oO 


The left-hand side of Equation 5.33 explicitly recognizes that the general form 
of the normal distribution function is conditioned on two parameters — the 
mean of the distribution jz: and the variance of the distribution 07. Given this 
specification, consider a bivariate normal distribution function where the two 
random variables are independent. 


f(z y| be m o2 a.) = 1 exp (@ = pa)” (y — fy)” (5 34) 
’ aiPyy oy On 0 y2T . . 


Equation 5.34 can be easily factored into forms similar to Equation 5.33. 
Definition 5.3 builds on Equation 5.34 by introducing a coefficient that 
controls the correlation between the two random variables (p). 


Definition 5.3. The bivariate normal density is defined by 


f (x, y| plas Pgs Ox Tye) = 


ex Ae |(S#2)" + (LGB) — 2p (BE He) (2)]} 


(5.35) 


Theorem 5.4 develops some of the conditional moments of the bivariate normal 
distribution presented in Equation 5.35. 


Theorem 5.4. Let (X,Y) have the bivariate normal density. Then the 
marginal densities fx (X) and fy (Y) and the conditional densities f (Y|X) 
and f(X|Y) are univariate normal densities, and we have E[X] = px, 
E[Y] = py, V[X] = 0%, V[Y] = 0%, Corr (X,Y) = p, and 


B[Y|X] = wy +p (X — ux) 
ss (5.36) 
V[Y|X] =o? (1-7). 


Proof. Let us start by factoring (% — wz) /o, out of the exponent term in 
Equation 5.35. Specifically, suppose that we solve for K such that 


Ts Ge é : ei | x (2H). (5.37) 
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Technically, this solution technique is referred to as the method of unknown 
coefficients — we introduce an unknown coefficient, in this case kK, and attempt 
to solve for the original expression. Notice that (2 — ux) /o,z is irrelevant. The 
problem becomes to solve for K such that 


1 1 
1 (1 — p?) i (5.38) 


K“Fqop), Sap) 20=p) 


Therefore we can rewrite Equation 5.37 as 


l a MO og ~ pe\? 2 aap SN? 

py ese ee ae ee (< e ) (5.39) 

Substituting the result of Equation 5.39 into the exponent term in Equa- 
tion 5.35 yields 


a G (se) ea % (*<") (Gy 


(5.40) 
We can rewrite the first term (in brackets) of Equation 5.40 as a square. 


pos 2 = 2 fa = 2 
e( Hs) +(! un) 2»( Hs) € Hn) 
Ox Oy Ox Oy 


(5.41) 
2 
y—H fle 
fe cae) 
Oy Ox 
Multiplying the last result in Equation 5.41 by 07/07 yields 
1 or 
7 E by — p— (z rm) (5.42) 
Oy Ox 


Hence the density function in Equation 5.35 can be rewritten as 


1 1 | Oy ( i) 
= exp Y — by — p— (@ — be 
oy V2rv/1 — p? 20; (1 _ p) = Ox 


(5.43) 
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where f; is the density of N (Gee o2) and f2); is the conditional density func- 


tion of N (j4y + poyoz | (@ — fz) ,o¥ (1 — p?)). To complete the proof, start 
by taking the expectation with respect to Y. 


f(e)= f Afardy 


= fi / fayidy oe 
= fi. 
This gives us x ~ N (wx,0%). Next, we have 
f (ula) = LEW) — nh _ 5, (5.45) 


f(z) fi 


which proves the conditional relationship. 
By Theorem 4.23 (Law of Iterated Means), E[¢ (X,Y )] = ExEy x [¢(X, Y)] 
where Ex denotes the expectation with respect to X. 


E [XY] = ExE[XY|X] = Ex [XE[Y|X]] 


(on 
= Ex |Xpy + pax (X — px) (5.46) 


= Uxpy + poxoy. 


Also notice that if the random variables are uncorrelated then p — 0 and by 
Equation 5.46 E[XY] = uxpy. 

Following some of the results introduced in Chapter 4, a linear sum of 
normal random variables is normal, as presented in Theorem 5.5. 


Theorem 5.5. If X and Y are bivariate normal and a and £ are constants, 
then aX + BY is normal. 


The mean of a sample of pairwise independent normal random variables is also 
normally distributed with a variance of ¢?/N, as depicted in Theorem 5.6. 


Theorem 5.6. Let { X;}, i= 1,2,...N be pairwise independent and iden- 
tically distributed sample means where the original random variables are dis- 


tributed N (1,07). Then = 1/N yo X; is N (u,07/N). 
And, finally, if two normally distributed random variables are uncorrelated, 


then they are independent. 


Theorem 5.7. If X and Y are bivariate normal and Cov (X,Y) = 0, then 
X and Y are independent. 
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5.4.2. Multivariate Normal Distribution 


Expansion of the normal distribution function to more than two variables 
requires some matrix concepts introduced in Chapter 10. However, we start 
with a basic introduction of the multivariate normal distribution to build on 
the basic concepts introduced in our discussion of the bivariate distribution. 


Definition 5.8. We say X is multivariate normal with mean p and variance- 
covariance matrix ©, denoted N (yu, 5), if its density is given by 


f(x) = On)" [Pew [—5(@—w Ee]. (64T) 


Note first that || denotes the determinant of the variance matrix (discussed in 
Section 10.1.1.5). For our current purposes, we simply define the determinant 
of the 2 x 2 matrix as 


O11 O12 


Js] =| ° 
21 922 


= 011022 — 012021. (5.48) 


Given that the variance matrix is symmetric (i.e., 012 = 021), we could write 
|| = 011022 — eee 

The inverse of the variance matrix (i.e, U~+ in Equation 5.47) is a lit- 
tle more complex. We will first invert the matrix where the coefficients are 
unknown scalars (i.e., single numbers) by row reduction. 


| 1 0 O11 012 1 0 
-221 | 1 
oir O21 922 | 0 
| O11 012 1 0 
= 012021 O21 
0 22 O11 a, 

1 0 Ou O12 1 0 
nS 0 012022 — 012021 O21 4 
011922 — 012021 O11 O11 

(5.49) 
_[ ou 12 1 0 
on 0 1 O21 O11 
011922 — 912021 911922 — 012021 
i 22 012 1 0 
0 io ae “day O11 
0 il 
0119022 — 012021 011922 — 012021 
ae 3G: |'—4 + 012021 012 
= P11 * 041 (011022 — 012021) 011022 — 12021 | 
01 021 O11 
011022 — 012021 011022 — 012021 
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Focusing on the first term in the inverse, 


i 912021 _ 911022 — 912021 + 012021 
f = 
O11 O11 (011022 — 012021) O11 (011022 — 012021) 
(5.50) 
011022 O11 


O11 (011022 — 012021) 011922 — 012022 
Next, consider inverting a matrix of matrices by row reduction following 
the same approach used in inverting the matrix of scalars. 


Uxx Lxy|l 0 -1 
| Myx yy |O LI | Uxx ft 
wee 10 


(5.51) 


i ‘See Exe 
by x byy 


0 | Ry —SyxR 


E Dy ee 


ae Hi 
0 Dyy —DyxPykExy 


—YyxLyy J 


Multiplying the last row of Equation 5.51 by (Syy —Uyx¥xExy) and 
then subtracting Sis xy times the second row from the first row yields 


a 3 2 -1 ie 
Eyx + UYyExy (“yy - Ey xUxxExy) SyxUyx 
— (yy — DyxDZUXY)  LyxUZy 
(5.52) 


—Yy x UXY (Syy — Syx0xyDxyv) 
(yy — Eyxlyx oxy) _ 


As unwieldy as the result in Equation 5.52 appears, it yields several useful 
implications. For example, suppose we were interested in the matrix relation- 


ship 
= Y1 _ I B Vy 
y= Br |! | =| 4 ale (5.53) 
so that yy = 2, + Brg and yo = £9. Notice that the equation leaves open the 
possibility that both y; and yg are vectors and B is a matrix. 


Next, assume that we want to derive the value of B so that y; is uncorre- 
lated with ye (i-e., E [(yi — L1)' (yo — L2)] = 0). Therefore, 


E [(a1 + Brg — 1 — Buz)’ (x2 — p2)] =0 
E[{(#1 — #1) + B (a2 — p2)} "(wa — p2)] = 0 
E [(a1 — pi)’ (2 — 2) + B (a2 — p12)’ (t2 — pa)] = 0 (5.54) 
Viz + BY22 = 0 


=> B= —Sy2¥5o and yy = 21 — L205 22. 
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Not to get too far ahead of ourselves, but Vigkse is the general form of 
the regression coefficient presented in Equation 4.50. Substituting the result 
from Equation 5.54 yields a general form for the conditional expectation and 
conditional variance of the normal distribution. 


Ya \i.f 2 ed, xX, 
Y5 _ 0 I X2 


Yi tage M1 ba — D12D55 ba 
E = 22 = 22 5.55 
( Y2 ) ( 0 I ) ( Ha aD) oo) 


5.5 


V ys Ga ame Oo Y12h59 D2 0 
Yo 0 Xoo J” 


Chapter Summary 


e The normal distribution is the foundation of a wide variety of econometric 
applications. 


— The Central Limit Theorem (developed in Chapter 6) depicts how the 


sum of random variables (and hence their sample averages) will be 
normally distributed regardless of the original distribution of random 
variables. 


— If we assume a random variable is normally distributed in small sam- 


ples, we know the small sample distribution of a variety of sample 
statistics such as the mean, variance, and regression coefficients. As 
will be developed in Chapter 7, the mean and regression coefficients 
follow the Student’s t distribution while the variance follows a chi- 
squared distribution. 


e The multivariate normal distribution provides for the analysis of relation- 
ships between random variables within the distribution function. 


5.6 
5-1R. 


5-2R. 


5-3R. 


Review Questions 


Derive the general form of the normal variable from the standard nor- 
mal using a change in variables technique. 


In Theorem 5.4, rewrite the conditional expectation in terms of the 
regression ( (i.e., 8 = Cov [XY] /V [X]). 


Prove the variance portion of Theorem 5.6. 
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5.7 
0-1E. 


5-2E. 


5-3E. 


5-4E. 


5-4E. 
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Numerical Exercises 


Derive the absolute approximation error for 6, 9, 12, and 15. Draw 
binomial variables and compute the percent absolute deviation between 
the binomial variable and a standard normal random variable. Does this 
absolute deviation decline? 


Given that a random variable is distributed normally with a mean of 
5 and a variance of 6, what is the probability that the outcome will be 
less than zero? 


Construct a histogram for 10 sums of 10 Bernoulli draws (normalized 
by subtracting their means and dividing by their theoretical standard 
deviations). Compare this histogram with a histogram of 20 sums of 
10 Bernoulli draws normalized in the same manner. Compare each his- 
togram with the standard normal distribution. Does the large sample 
approach the standard normal? 


Compute the sample covariance matrix for the farm interest rate for 
Alabama and the Baa Corporate bond rate using the data presented 
in Appendix D. What is the correlation coefficient between these two 
series? 


Appendix D presents the interest rate and the change in debt to asset 
ratio in the southeastern United States for 1960 through 2003 as well 
as the interest rate on Baa Corporate bonds from the St. Louis Federal 
Reserve Bank. The covariance matrix for interest rates in Florida, the 
change in debt to asset ratio for Florida, and the Baa Corporate bond 
rate is 


0.0002466  —0.0002140 0.00031270 
S= | —0.0002140 0.0032195  —0.00014738 | . (5.56) 
0.0003127 —0.00014738 0.00065745 


Compute the projected variance of the interest rate for Florida condi- 
tional on Florida’s change in debt to asset ratio and the Baa Corporate 
bond rate. 


Part II 


Estimation 
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In this chapter we want to develop the foundations of sample theory. First 
assume that we want to make an inference, either estimation or some test, 
based on a sample. We are interested in how well parameters or statistics based 
on that sample represent the parameters or statistics of the whole population. 


Da 


6.1 Convergence of Statistics 


In statistical terms, we want to develop the concept of convergence. Specifi- 
cally, we are interested in whether or not the statistics calculated on the sample 
converge toward the population values. Let {X,,} be a sequence of samples. 
We want to demonstrate that statistics based on {X,,} converge toward the 
population statistics for X. 
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Probability Density Function for the Sample Mean. 


As a starting point, consider a simple estimator of the first four central 
moments of the standard normal distribution. 


(6.1) 


Given the expression for the standard normal distribution developed in Chap- 
ter 5, the theoretical first four moments are py (%) = 0, fe (x) = 1, ws (2) = 0, 
and j14 (a) = 3. Figure 6.1 presents the empirical probability density functions 
for the first empirical moment (the sample mean) defined in Equation 6.1 for 
samples sizes of 50, 100, 200, 400, 800, 1600, and 3200. As depicted in Fig- 
ure 6.1, the probability density function of the sample mean concentrates (or 
converges) around its theoretical value as the sample size increases. 
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Another way of looking at the concept of convergence involves examining 
the quantiles of the distribution of the sample moments. The quantile is a 
measure that asks what level of random variable X yields a probability of a 
value less than p. Mathematically, the quantile is defined as 


x* (p) > [ f (z) dz (6.2) 


where f(z) is a valid density function. If p = {0,0.25,0.50, 0.75, 1.0}, this 
measure defines the quartiles of the distribution. Table 6.1 presents the 


TABLE 6.1 
Quartiles for Sample Moments of the Standard Normal 


Quartiles Interquartile Range 
0% 25% 50% 75% 100% Raw Normalized 
Mean — First Moment 
50 —0.3552 —0.0979 —0.0128 0.0843 0.2630 0.1821 
100 —0.3485 —0.0586 0.0012 0.0622 0.2028 0.1208 
200 —0.2264 —0.0510 —0.0028 0.0377 0.1412 0.0888 
400 —0.1330 —0.0323 0.0010 0.0439 0.1283 0.0762 
800 —0.0991 —0.0172 0.0052 0.0259 0.0701 0.0431 
1600 —0.0622 —0.0220 0.0029 0.0213 0.0620 0.0433 
3200  —0.0425 —0.0164 —0.0003 0.0110 0.0397 0.0274 
Variance — Second Central Moment 
50 0.5105 0.8676 0.9481 1.0918 1.5562 0.2242 
100 0.7115 0.8947 0.9862 1.0927 1.3715 0.1980 
200 0.8024 0.9248 0.9971 1.0857 1.3140 0.1609 
400 0.8695 0.9491 0.9966 1.0419 1.2266 0.0928 
800 0.9160 0.9540 0.9943 1.0287 1.1443 0.0747 
1600 0.9321 0.9771 1.0047 1.0267 1.0933 0.0495 
3200 0.9414 0.9856 0.9978 1.0169 1.0674 0.0313 
Skewness — Third Central Moment 
50 —0.7295 —0.1412 —0.0333 0.1582 0.8134 0.2994 
100 —0.6869 —0.1674 —0.0008 0.1646 0.7121 0.3320 
200 —0.4252 —0.0968 —0.0116 0.1392 0.4867 0.2361 
400 —0.3163 —0.0797 —0.0036 0.0703 0.3953 0.1500 
800 —0.2185 —0.0601 —0.0133 0.0605 0.1841 0.1205 
1600 —0.1397 —0.0428 —0.0011 0.0448 0.1356 0.0877 
3200 —0.1288 —0.0208 0.0058 0.0285 0.1068 0.0493 
Kurtosis — Fourth Central Moment 
50 0.7746 2.0197 2.6229 3.3221 7.9826 1.3025 0.4342 
100 1.3950 2.1279 2.7092 3.4366 6.1016 1.3086 0.4362 
200 1.5611 2.4127 2.8683 3.5025 5.0498 1.0898 0.3633 
400 1.9865 2.6823 2.8981 3.3730 4.7347 0.6907 0.2302 
800 2.4249 2.7026 3.0089 3.2071 3.9311 0.5045 0.1682 
1600 2.5575 2.8189 2.9881 3.2241 4.0558 0.4052 0.1351 
3200 2.6229 2.9042 3.0351 3.1240 3.6167 0.2198 0.0733 
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quartiles for each sample statistics and sample size. Next, consider the in- 
terquartile range defined as 


R(p) = x* (0.75) — v* (0.25) . (6.3) 


Examining the quartile results in Table 6.1, notice that the true value of the 
moment is contained in the quartile range. In addition, the quartile range 
declines as the sample size increases. Notice that the quartile range for the 
fourth moment is normalized by its theoretical value. However, even with 
this adjustment, the values in Table 6.1 indicate that higher order moments 
converge less rapidily than lower order moments (i.e., kurtosis converges more 
slowly than the mean). 

Taking a slightly different tack, the classical assumptions for ordinary least 
squares (OLS) as presented in White [53] are identified by Theorem 6.1. 


Theorem 6.1. The following are the assumptions of the classical linear 
model. 


(i) The model is known to be y= XB+e, 8 < oo. 
(ii) X is a nonstochastic and finite n x k matrix. 
(iit) X'X is nonsingular for alln > k. 

(iv) E(e) =0. 


(vu) e~N(0,0§1), 03 < 00. 


Given these assumptions, we can conclude that 
a) Existence given (i) — (iii) 8, exists for all n > k and is unique. 
b) Unibiasedness given (i) — (v) E[Gn] = Bo. 
c) Normality given (i) — (v) Bh ~ N (0. o Cox): 


d) Efficiency given (i) — (v) 6» is the maximum likelihood estimator and 
the best unbiased estimator in the sense that the variance of any other 
unbiased estimator exceeds that of 6, by a positive semi-definite matrix 
regardless of the value of {o. 


Existence, unbiasedness, normality, and efficiency are small sample analogs 
of asymptotic theory. Unbiased implies that the distribution of 6, is centered 
around $9. Normality allows us to construct t-distribution or F-distribution 
tests for restrictions. Efficiency guarantees that the ordinary least squares 
estimates have the greatest possible precision. 

Asymptotic theory involves the behavior of the estimator under the failure 
of certain assumptions — specifically, assumptions (ii) or (v). The possible fail- 
ure of assumption (ii) depends on the ability of the econometrician to control 
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the sample. Specifically, the question is whether to control the experiment by 
designing or selecting the levels of exogenous variables. The alternative is the 
conjecture that the econometrician observed a sample of convenience. In 
other words, the economy generated the sample through its normal operation. 
Under this scenario, the X matrix was in fact stochastic or random and the 
researcher simply observes one possible outcome. 

Failure of assumption (v) also pervades economic applications. Specifically, 
the fact that residual or error is normally distributed may be one of the 
primary testable hypotheses in the study. For example, Moss and Shonkwiler 
[33] found that corn yields were non-normally distributed. Further, the fact 
that the yields were non-normally distributed was an important finding of the 
study because it affected the pricing of crop insurance contracts. 

The potential non-normality of the error term is important for the classical 
linear model because normality of the error term is required to strictly apply 
t-distributions or F-distributions. However, the central limit theorem can be 
used if n is large enough to guarantee that {,, is approximately normal. 

Given that the data collected by the researcher conforms to the classical 
linear model, the estimated coefficients are unbiased and hypothesis tests on 
the results are correct (i.e., the parameters are distributed t and linear com- 
binations of the paramters are distributed F’). However, if the data fails to 
meet the classical assumptions, the estimated parameters converge to their 
true value as the sample size becomes large. In addition, we are interested in 
modifying the statistical test of significance for the model’s parameters. 


6.2 Modes of Convergence 


As a starting point, consider a simple mathematical model of convergence for 
a nonrandom variable. Most students have seen the basic proof that 
. dl 

lim — =0 (6.4) 

woo £ 
which basically implies that as x becomes infinitely large, the function value 
f (a) =1/x can be made arbitrarily close to zero. To put a little more math- 
ematical rigor on the concept, let us define 6 and € such that 


f(e+6)= 


as se (6.5) 
Convergence implies that for any € there exists a 6 that meets the criteria 
in Equation 6.5. The basic concept in Equation 6.5 is typically used in cal- 
culus courses to define derivatives. In this textbook we are interested in a 
slightly different formulation — the limit of a sequence of numbers as defined 
in Definition 6.2. 
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Definition 6.2. A sequence of real numbers {a,,}, n = 1,2,--- is said to 
converge to a real number a if for any € > 0 there exists an integer N such 
that for all n > N we have 

lan — al <e. (6.6) 


This convergence is expressed a, > @ as n > 00 or limp_+o An = a. 

This definition must be changed for random variables because we cannot 
require a random variable to approach a specific value. Instead, we require 
the probability of the variable to approach a given value. Specifically, we want 
the probability of the event to equal 1 or zero as n goes to infinity. This 
concept defines three different concepts or modes of convergence. First, the 
convergence in probability implies that the distance between the sample 
value and the true value (i.e., the absolute difference) can be made small. 


Definition 6.3 (Convergence in Probability). A sequence of random variables 
{X,}, n =1,2,--- is said to converge to a random variable X in probability 
if for any « > 0 and 6 > 0 there exists an integer N such that for alln > N 
we have P (|X, — X| < «) > 1—6. We write 


Nes (6.7) 


plim,,_,..*n = X. The last equality reads — the probability limit of X, is X. 
(Alternatively, the clause may be paraphrased as lim P (|X, — X| < €) = 1 for 
any € > 0). 


Intuitively, this convergence in probability is demonstrated in Table 6.1. As 
the sample size expands, the quartile range declines and |X, — X| becomes 
small. 

A slightly different mode of convergence is convergence in mean 
square. 


Definition 6.4 (Convergence in Mean Square). A sequence {X,,} is said to 
converge to X in mean square if limp. E (Xn — Ke = 0. We write 


Ki ee (6.8) 


Table 6.2 presents the expected mean squared errors for the sample statistics 
for the standard normal distribution. The results indicate that the sample 
statistics converge in mean squared error to their theoretical values — that 
is, the mean squared error becomes progressively smaller as the sample size 
increases. 

A final type of convergence, convergence in distribution, implies that 
the whole distribution of X,, approaches the distribution function of X. 


Definition 6.5 (Convergence in Distribution). A sequence {X,,} is said to 
converge to X in distribution if the distribution function F,, of X, converges 
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TABLE 6.2 
Convergence in Mean Square for Standard Normal 


Sample Size 
Statistic 50 100 200 400 800 1600 3200 
Mean 0.03410 0.01148 0.00307 0.00073 0.00018 0.00005 0.00001 
Variance 0.07681 0.01933 0.00535 0.00130 0.00029 0.00007 0.00002 
Skewness 0.17569 0.06556 0.01614 0.00379 0.00092 0.00021 0.00006 
Kurtosis 3.40879 0.98780 0.25063 0.06972 0.01580 0.00506 0.00110 


to the distribution function F of X at every continuity point of F. We 
write F 
Xn —> X (6.9) 


and call F the limit distribution of {X,,}. If {X,} and {Y,,} have the same 
limit distribution, we write 


Xn = Ya. (6.10) 


Figures 5.2 and 5.3 demonstrate convergence in probability. Figure 5.2 demon- 
strates that the normalized binomial converges to the standard normal, while 
Figure 5.3 shows that the normalized sum of uniform random variables con- 
verges in probability to the standard normal. 

The differences in types of convergence are related. Comparing Defi- 
nition 6.3 with Definition 6.4, E(X, —X)* will tend to zero faster than 
|X, — X|. Basically, the squared convergence is faster than the linear con- 
vergence implied by the absolute value. The result is Chebyshev’s theorem. 


Theorem 6.6 (Chebyshev). 


KS Kk (6.11) 


Next, following Definition 6.5, assume that f,, (7) > f (x), or that the sample 
distribution converges to a limiting distribution. 


7 (z —2)* fn (z) dz — ‘a (z —2z)* f (z) dz 
I. oo i (6.12) 
‘ (z — 2)" [fn (2) — f (2) dz = 0. 


Therefore convergence in distribution implies convergence in mean square. 


Theorem 6.7. 
LNs ee (6.13) 
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The convergence results give rise to Chebyshev’s inequality, 


Eg (Xn)] 


P [9 (Xn) 2€] < eo 


(6.14) 


which can be used to make probabilistic statements about a variety of statis- 
tics. For example, replacing g (X,,) with the sample mean yields a form of the 
confidence interval for the sample mean. Extending the probability statements 
to general functions yields 


Theorem 6.8. Let X, be a vector of random variables with a fixed finite 


number of elements. Let g be a function continuous at a constant vector point 
a. Then a . 
Xn — a> g(Xn) > g(a). (6.15) 


Similar results are depicted in Slutsky’s theorem. 
Theorem 6.9 (Slutsky). [f X, 4X and Y, Sa, then 


pee iy emcee grin 


AnY a ay aX (6.16) 
XxX 
a 


6.2.1 Almost Sure Convergence 


Let w represent the entire random sequence {Z;,}. As before, our interest 
typically centers around the averages of this sequence. 


1: n 
bn (w) = — > (6.17) 
t=1 
Definition 6.10. Let {b, (w)} be a sequence of real-valued random variables. 
We say that b, (w) converges almost surely to b, written 
bp (w) $b (6.18) 
if and only if there exists a real number b such that 
P lw: bp (w) > b] = 1. (6.19) 
The probability measure P describes the distribution of w and determines 
the joint distribution function for the entire sequence {Z,}. Other common 


terminology is that b, (w) converges to b with probability 1 or that 6, (w) is 
strongly consistent for b. 
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Example 6.11. Let 


| 
Zn = — SoZ (6.20) 


where {Z;} is a sequence of independently and identically distributed (i.i.d.) 
random variables with E[Z;,] = u < co. Then 


Zn > (6.21) 
by Kolmogorov’s strong law of large numbers, Proposition 6.22 [8, p. 3]. 


Proposition 6.12. Given g: R® > R! (k,l <0) and any sequence {bn} 
such that 


by > b (6.22) 
where by, and b are k x 1 vectors, if g is continuous at b, then 
9 (bn) + 9 (0). (6.23) 


This result then allows us to extend our results to include the matrix 
operations used to define the ordinary least squares estimators. 


Theorem 6.13. Suppose 
a) y= XPot+e; 


/ 
b) Ae 25 Q; 


, 


c) xX X 28; Vf ig finite and positive definite. 


Then By exists a.s. for all n sufficiently large, and By Bi8s. Bo. 


Proof. Since X'X/n "3 M, it follows from Proposition 2.11 that 
det (X’X/n) *“s det (M). Because M is positive definite by (c), det (M) > 0. 
It follows that det (X’X/n) > 0 as. for all n sufficiently large, so (X’X)7* 
exists for all n sufficiently large. Hence 


i RO ky 
n= ( " ) (6.24) 
exists for all n sufficiently large. In addition, 
= , 
aX xX Xe 


It follows from Proposition 6.12 that 


Bn *"+ Bo + M-10 = Bo. (6.26) 
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The point of Theorem 6.13 is that the ordinary least squares estimator Bn 
converges almost surely to the true value of the parameters (@. The concept is 
very powerful. Using the small sample properties, we are sure that the ordinary 
least squares estimator is unbiased. However, what if we cannot guarantee the 
small sample properties? What if the error is not normal, or what if the values 
of the independent variables are random? Given Theorem 6.13, we know that 
the estimator still converges almost surely to the true value. 


6.2.2 Convergence in Probability 
A weaker stochastic convergence concept is that of convergence in probability. 


Definition 6.14. Let {b, (w)} be a sequence of real-valued random variables. 
If there exists a real number b such that for every € > 0, 


P lw: |bp (w) —b] < ef 1 (6.27) 
as n — oo, then by, (w) converges in probability to 6. 


The almost sure measure of probability takes into account the joint dis- 
tribution of the entire sequence {Z;}, but with convergence in probability, 
we only need to be concerned with the joint distribution of those elements 
that appear in b,, (w). Convergence in probability is also referred to as weak 
consistency. 


Theorem 6.15. Let {by (w)} be a sequence of random variables. If 
b, “3b, then by, —> b. (6.28) 


If by, converges in probability to b, then there exists a subsequence {bn, } such 
that 
bn; —> b. (6.29) 


Given that we know Theorem 6.13, what is the point of Theorem 6.15? 
Theorem 6.13 states that the ordinary least squares estimator converges al- 
most surely to the true value. Theorem 6.15 states that given Theorem 6.13, 
then 6, converges in probability to b. The point is that we can make fewer 
assumptions to guarantee Theorem 6.15. Making fewer assumptions is always 
preferred to making more assumptions — the results are more robust. 


6.2.3. Convergence in the rth Mean 


Earlier in this chapter we developed the concept of mean squared convergence. 
Generalizing the concept we consider the rth power of the expectation yielding 


Definition 6.16. Let {b, (w)} be a sequence of real-valued random variables. 
If there exists a real number 6 such that 


E [lbp (w) —B|"] 3 0 (6.30) 
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Utility u(¥) 
U(y2) 
O(E[T |) eeuateReseecce tees 
E[U(Y)] }------------4----5 s 
U(») 
ss Income 


FIGURE 6.2 
Expected Utility. 


as n — co for some r > 0, then by (w) converges in the rth mean to b, written 
as 
bn (w) s b. (6.31) 


Next, we consider a proposition that has applications for both estimation 
and economic applications — Jensen’s inequality. Following the development 
of expected utility in Moss [32, pp. 57-82], economic decision makers choose 
among alternatives to maximize their expected utility. As depicted in Fig- 
ure 6.2, the utility function for risk averse decision makers is concave in in- 
come. Hence, the expected utility is less than the utility at the expected level 
of income. 


E[U (Y)] = pU (1) + (1 —p)U (ya) < U [px y1 + A — p) x yo] = U[E(Y)). 

(6.32) 
Hence, risk averse decision makers are worse off under risk. Jensen’s inequality 
provides a general statement of this concept. 


Proposition 6.17 (Jensen’s inequality). Let g : R' > R! be a conver 
function on the interval B C R! and Z be a random variable such that 
P|Ze€B| = 1. Then g[E(Z)] < E[g(Z)]. If g is concave on B, then 
gE (Z)] = E[g (Z)]. 

In addition, the development of the convergence of the rth mean allows 


for the statement of a generalized version of Chebychev’s inequality presented 
in Equation 6.14. 
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Proposition 6.18 (Generalized Chebyshev Inequality). Let Z be a random 
variable such that E|Z|" < co, r > 0. Then for every € >0 


E(IZI") 


P(|Z|>e< 
€ 


(6.33) 


Another useful inequality that follows the rth result is Holder’s inequality. 
Proposition 6.19 (Holder’s Inequality). [fp > 1, ae =1, andifE|Y|? < 
co and E|Z|* < oo, then 


All 1 
E|YX| < (Bly?) E|x|y'". (6.34) 


If p = q = 2, we have the Cauchy—Schwartz inequality 
E[|Y Z|) <E[y?]'? EB [2] (6.35) 


The rth convergence is also useful in demonstrating the ordering of con- 
vergence presented in Theorem 6.6. 


Theorem 6.20. Jf b, (w) <s b for some r > 0, then by (w) > b. 


The real point to Theorem 6.21 is that we will select the estimator that min- 
imizes the mean squared error (i.e., r = 2 in the root mean squared error). 
Hence, Theorem 6.21 states that an estimator that converges in mean squared 
error also converges in probability. 


6.3. Laws of Large Numbers 


Given the above convergence results, we can show that as the size of the 
sample increases, the sample statistic converges to the underlying population 
statistic. Taking our initial problem as an example, 


lim M, (a) = pi (x) = lim Mj (an) — 1 (x) = 0 
lim M2 (an) = 2 (4) = lim Mp2 (an) — 2 (x) = 0 


Proposition 6.21. Given restrictions on the dependence, heterogeneity, and 
moments of a sequence of random variables {X+}, 


Xe =p, 60 (6.37) 
where 


= Le = 
Xn = = S~ X; and jin = E (Xn). (6.38) 


i=1 
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Specifically, if we can assume that the random variables in the sample (X;) 
are independently and identically distributed then 


Theorem 6.22 (Kolmogorov). Let {X;} be a sequence of indepedently and 
identically distributed (i.i.d.) random variables. Then 


Xn (6.39) 
if and only if E|X1| < oo and E(X;) =p. 


Thus, the sample mean converges almost surely to the population mean. 
In addition, letting {X;} be independent and identically distributed with 
E[X;] = p, 

pena (6.40) 


This result is known as Khintchine’s law of large numbers [8, p. 2]. 


6.4 Asymptotic Normality 


Under the traditional assumptions of the linear model (fixed regressors and 
normally distributed error terms), 8, is distributed multivariate normal with 


E [Bn| = Bo 
(6.41) 
V [Bn| = 92 (X'X)7} 


for any sample size n. However, when the sample size becomes large the dis- 
tribution of 8, is approximately normal under some general conditions. 


Definition 6.23. Let {b,} be a sequence of random finite-dimensional vectors 
with joint distribution functions {F,,}. If Fy, (z) > F(z) as n > oo for every 
continuity point z, where F is the distribution function of a random variable 
Z, then b,, converges in distribution to the random variable Z, denoted 


be. (6.42) 


Intuitively, the distribution of b,, becomes closer and closer to the distribution 
function of the random variable Z. Hence, the distribution F' can be used as 
an approximation of the distribution function of b,,. Other ways of stating this 
concept are that b,, converges in law to Z. 


bias Ye (6.43) 
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Or, b, is asymptotically distributed as F 

bn © F. (6.44) 
In this case, F’ is called the limiting distribution of by. 


Example 6.24. Let {X;} be an i.i.d. sequence of random variables with mean 
p and variance a? < oo. Define 


Xie EX, PPG 
ora Ba = (+) ea — B) (6.45) 
(V [Xn]) i=l 
Then by the Lindeberg—Levy central limit theorem (Theorem 6.24), 
bn © N(0,1). (6.46) 


Theorem 6.25 (Lindeberg-Levy). Let {X;} be i.i.d. with E[X;] = uw and 
V (X;) = 07. Then, defining Z, as above, Zn, — N (0,1). 


In this textbook, we will justify the Lindeberg—Levy theory using a general 
characteristic function for a sequence of random variables. Our demonstra- 
tion stops a little short of a formal proof but contains the essential points nec- 
essary to justify the result. The characteristic function of a random variable 
X is defined as 


ox (t) = E[e’*] = E [cos (tX) + isin (tX)| 
(6.47) 
= E[cos (tX)] + iE [sin (tX)] . 


This function may appear intimidating, but recalling our development of 
the normal distribution in Section 3.7, we can rewrite the function f(x) = 
(x — 5)" /5 as 

(6) = r (6) cos (6) (6.48) 
where the imaginary term is equal to zero. Hence, we start by defining the 
characteristic function of a random variable in Definition 6.26. 


Definition 6.26. Let Z be ak x 1 random vector with distribution function 
F.. The characteristic function of F' is defined as 


f(A) =Blexp (02) (6.49) 
where i? = —1 and \ isa k x 1 real vector. 


Notice the similarity between the definition of the moment generating func- 
tion and the characteristic function. 


Mx (t) = E [exp (tz)] 
(6.50) 
f (A) = Efexp (i’z)]. 
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In fact, a weaker form of the central limit theorem can be demonstrated using 
the moment generating function instead of the characteristic function. 

Next, we define the characteristic function for the standard normal distri- 
bution as 


Definition 6.27. Let Z ~ N (u,07). Then 


f (A) =exp (iru = =) (6.51) 


It is important that the characteristic function is unique given the den- 
sity function for any random variable. Two random variables with the same 
characteristic function also have the same density function. 


Theorem 6.28 (Uniqueness Theorem). Two distribution functions are iden- 
tical if and only if their characteristic functions are identical. 


This result also holds for moment generating functions. Hence, the point of 
the Lindeberg—Levy proof is to demonstrate that the characteristic function 
for Z, (the standardized mean) approaches the distribution function for the 
standard normal distribution as the sample size becomes large. 


Lindeberg—Levy. First define f(A) as the characteristic function for 
(Z, — pw) /o and let fp (A) be the characteristic function of 


a=) _ (1) S (BEE), (6.52) 


On n (or 


By the structure of the characteristic function we have 
m 
n r — —_ 
al) (; vn ) 


In (fn (A)) = nIn (1 (=). 


Taking a second order Taylor series expansion of f (A) around A = 0 gives 


(6.53) 


+0 (A). (6.54) 


In fa (A)) = nIn 1 ~+0(~)| pica (6.55) 
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Another proof of the central limit theorem involves taking a Taylor series 
expansion of the characteristic function around the point t = 0, yielding 


oz (t) = ¢ (0) + “0 (O0)t+ 0 (0) t? + 0(¢t?) 


Rai (6.56) 
s.t.Z = i 
an 
To work on this expression we note that 
ox (0) =1 (6.57) 
for any random variable X, and 
o (0) = iE (X*). (6.58) 


Putting these two results into the second-order Taylor series expansion, 


EZ), EC)? ny 1 8 oye 
ee a ae Free J=los Fee) (6.59) 
>:E(Z) =0, E(Z?) =1 
Thus, 
62 (t) = 6: (0) + 46% (t+ 30! (0) +0(2?) 
=14 se + o (t”) (6.60) 


ie i 
Sis Stor) ei SE) 3: y ~ N(0,1) 


or Z is normally distributed. This approach can be used to prove the central 
limit theorem from the moment generating function. 

For completeness, there are other characteristic functions that the limit 
does not approach. The characteristic function of the uniform distribution 
function is 


ox (t) = e* — 1. (6.61) 
The gamma distribution’s characteristic function is 
1 
ox (t) = (6.62) 


Thus, our development of the Lindeberg—Levy theorem does not assume the 
result. The fact that the sample means converge to the true population mean 
and the variance of the means converges to a fixed value drives the result. 
Essentially, only the first two moments matter asymptotically. 
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6.5 Wrapping Up Loose Ends 


Finally, we can use some of the implications of convergence to infer results 
about the correlation coefficient between any two random variables, bound 
the general probabilities, and examine the convergence of the binomial distri- 
bution to the normal. 


6.5.1 Application of Holder’s Inequality 


Using Holder’s inequality it is possible to place a general bound on the corre- 
lation coefficient regardless of the distribution. 


Example 6.29. If X and Y have means px, py and variances 0%, of, re- 
spectively, we can apply the Cauchy—Schwartz Inequality (Holder’s inequality 
with p = q=1/2 ) to get 


1/2 
B|(X — ux) (¥ — wy) < {E[(X — ue)"]} 7? {ely -n,)7]}. (6.68) 
Squaring both sides and substituting for variances and covariances yields 


(Cov(Xy))? <2 a2 (6.64) 


which implies that the absolute value of the correlation coefficient is less than 
one. 


6.5.2 Application of Chebychev’s Inequality 
Using Chebychev’s inequality we can bound the probability of an outcome of 


a random variable (x) being different from the mean for any distribution. 


Example 6.30. The most widespread use of Chebychev’s inequality involves 
means and variances. Let g(a) = (a — p)* /o?, where pp = E[X] and o? = 
V(X). Let p= 2". 
= 5 ee he deena 1 
p( AG >) < e|| ft) - (6.65) 


or t? 


Since 


aS - (6.66) 


= 0.25. (6.67) 
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However, this inequality may not say much, since for the normal distribution 


room fictem SEI] gy 


= 2x (0.0227) = 0.0455. 


Thus, the actual probability under the standard normal distribution is much 
smaller than the Chebychev inequality. Put slightly differently, the Chebychev 
method gives a very loose probability bound. 


6.5.3. Normal Approximation of the Binomial 


Starting from the binomial distribution function 


b(n,r,p) = mP” (l—p)"” (6.69) 


nm—-Tr 


n! 


first assume that n = 10 and p = 0.5. The probability of r < 3 is 
P(r <3) = 6(10,0,0.5) + 6 (10, 1,0.5) + 6 (10, 3, 0.5) = 0.1719. (6.70) 


Note that this distribution has a mean of 5 and a variance of 2.5. Given this 
we can compute 


3-5 
2 = — = — 1.265. 6.71 
V2.5 ( ) 


Integrating the standard normal distribution function from negative infinity 
to —1.265 yields 
-1.265 4 oe 
P (2* < —1.265) = —— exp -5| dz = 0.1030. 6.72 
=f eee |-F (6.72) 
Expanding the sample size to 20 and examining the probability that r < 6 


yields 
6 


P(r <6) = 5° 6(20,4,5) = 0.0577. (6.73) 

i=0 
This time the mean of the distribution is 10 and the variance is 5. The resulting 
z* = —1.7889. The integral of the normal distribution function from negative 


infinity to —1.7889 is 0.0368. 

As the sample size increases, the binomial probability approaches the nor- 
mal probability. Hence, the binomial converges in probability to the normal 
distribution, as depicted in Figure 5.2. 


6.6 Chapter Summary 


e Small sample assumptions such as the assumption that the set of in- 
dependent variables is fixed by the choice of the experimenter and the 
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normality of residuals yield very powerful identification conditions for 
estimation. 


e Econometricians very seldom are able to make these assumptions. Hence, 
we are interested in the limiting behavior of estimators as the sample size 
becomes very large. 


e Of the modes of convergence, we are typically most interested in conver- 
gence in distribution. Several distributions of interest converge in probabil- 
ity to the normal distribution such as ordinary least squares and maximum 
likelihood estimators. 


e Jensen’s inequality has implications for economic concepts such as ex- 
pected utility. 


6.7 Review Questions 


6-1R. A couple of production functions are known for their numerical limits. 
For example, the Spillman production function 


f (a1) = ao (1 — exp (a1 — a2%1)) (6.74) 
and its generalization called the Mitcherlich—Baule 
f (%1, 22) = Bo (1 — exp (81 — F221) (1 — exp (83 — Bar2)) (6.75) 


have limiting levels. Demonstrate the limit of each of these functions 
as 71,%2 7 &. 


6-2R. Following the discussion of Sandmo [40], demonstrate that the expected 
value of a concave production function lies below the production func- 
tion at its expected value. 


6.8 Numerical Exercises 


6-1E. Construct 10 samples of a Bernoulli distribution for 25, 50, 75, and 
100 draws. Compute the mean and standard deviations for each draw. 
Demonstrate that the mean and variance of the samples converge nu- 
merically. 


6-2E. Compare the probability that fewer than 10 heads will be tossed out of 
50 with the comparable event under a normal distribution. 


7 
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We will divide the discussion into the estimation of a single number such as a 
mean or standard deviation, or the estimation of a range such as a confidence 
interval. At the most basic level, the definition of an estimator involves the 
distinction between a sample and a population. In general, we assume that we 
have a random variable (X) with some distribution function. Next, we assume 
that we want to estimate something about that population, for example, we 
may be interested in estimating the mean of the population or probability 
that the outcome will lie between two numbers. In a farm-planning model, 
we may be interested in estimating the expected return for a particular crop. 
In a regression context, we may be interested in estimating the average effect 
of price or income on the quantity of goods consumed. This estimation is 


153 


154 Mathematical Statistics for Applied Econometrics 


typically based on a sample of outcomes drawn from the population instead 
of the population itself. 


7.1 Sampling and Sample Image 


Focusing on the sample versus population dichotomy for a moment, the sample 
image of X, denoted X*, and the empirical distribution function for f (X) can 
be depicted as a discrete distribution function with probability 1/n. 
Consider an example from production economics. Suppose that we observe 
data on the level of production for a group of firms and their inputs (e.g., the 
capital (A), labor (L), energy (£), and material (M) data from Dale Jorgen- 
son’s KLEM dataset [22] for a group of industries 7 = 1,--- N). Next, assume 
that we are interested in measuring the inefficiency given an estimate of the 
efficient amount of production associated with each input (4; (kj, l;, e;,m:)). 


€; = Yi — Gi (hi, Li, es, mi) - (7.1) 


For the moment assume that the efficient level of production is known without 
error. One possible assumption is that —e; ~ I’ (a, 3), or all firms are at most 
efficient (y; — 9 (ki, li, ei, mi) < 0). An example of the gamma distribution is 
presented in Figure 7.1. 

Given this specification, we could be interested in estimating the charac- 
teristics of the inefficiency for a firm in a specific industry — say the average 


0.12 7 
f (22.8) 
0.10 £ 
0.08 + 
0.06 | 


0.04 F 


0.02 F 


0.00 


FIGURE 7.1 
Density Function for a Gamma Distribution. 
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TABLE 7.1 
Small Sample of Gamma Random Variates 


Obs. € F(e6) F(é) Obs. € F(e) F(é) 

1 0.4704 0.02 0.0156 26 1.7103 0.52 0.4461 

2 0.4717 0.04 0.0157 27. «1.7424 0.54 0.4601 

3 0.5493 0.06 0.0256 28 1.8291 0.56 0.4971 

4 0.6324 0.08 0.0397 29 1.9420 0.58 0.5436 
5 0.6978 0.10 0.0532 30. «1.9559 0.60 0.5491 
6 
7 
8 


0.7579 0.12 0.0676 31 «1.9640 0.62 0.5524 

0.9646 0.14 0.1303 32 2.1041 0.64 0.6061 

0.9849 0.16 0.1375 33 2.2862 »=—0.66 ~=—-(0.6698 
9 0.9998 0.18 0.1428 34 =2.3890 = 0.68 ~—(0.6868 
10 «1.0667 0.20 0.1677 35 =. 2.3564 =0.70 ~=—(0.6923 
11 = 1.0927 0.22 0.1778 36 2.5629 = 0.72 = 0.7522 
12. 1.1193 0.24 = 0.1883 37 =—-2.6581 0.74 ~=—-0.7766 
13 1.1895 0.26 = (0.2169 38 2.8669 0.76 0.8234 
14 1.2258) =0.28 (0.2321 39 =. 2.9415 0.78 ~— (0.8381 
15 1.8933 0.80) 0.8051 40 3.0448 0.80 0.8566 
16 «1.41383 0.82 —-0.38140 41 3.0500 0.82 0.8575 
171.4354 0.34 (0.38238 42 3.0869 0.84 0.8637 
18 1.5084 0.86 = 0.38543 43 3.1295 0.86 0.8705 
19 1.5074 0.38 0.3561 44 3.1841 0.88 0.8788 
20 1.5074 0.40 0.3561 45 4.0159 0.90 0.9585 
21 1.5459 0.42 0.3733 46 4.1773 0.92 0.9667 
22. «1.56389 0.44 0.3814 47 4.2499 0.94 0.9699 
23 «1.5823 0.46 =: 0.3896 48 4.4428 0.96 0.9770 
24 1.5827 0.48 0.3898 49 4.4562 0.98 0.9774 
25 1.65383 0.50 0.4211 50 4.6468 1.00 0.9828 


technical inefficiency of firms in the Food and Fiber Sector. Table 7.1 presents 
one such sample for 50 firms in ascending order (i.e., this is not the order 
the sample was drawn in). In this table we define the empirical cumulative 
distribution as : 
A a 
Fla)=5 (7.2) 
where N = 50 (the number of oberservations). The next column gives the 
theoretical cumulative density function (F'(e;)) — integrating the gamma den- 
sity function from 0 to e;. The relationship between the empirical and theo- 
retical cumulative distribution functions is presented in Figure 7.2. From this 
graphical depiction, we conclude that the sample image (i.e., the empirical cu- 
mulative distribution) approaches the theoretical distibution. Given the data 
presented in Table 7.1, the sample mean is 2.03 and the sample variance is 1.27. 
Next, we extend the sample to N = 200 observations. The empirical 
and theoretical cumulative density functions for this sample are presented in 
Figure 7.3. Intuitively, the sample image for the larger sample is closer to the 
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F(s|a,6) 


0.8 F 


0.6 F 


0.0 i 1 1 1 1 1 1 1 1 1 
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n 
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FIGURE 7.2 
Empirical versus Theoretical Cumulative Distribution Functions — Small 
Sample. 
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FIGURE 7.3 
Empirical versus Theoretical Cumulative Distribution Functions — Large 
Sample. 
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FIGURE 7.4 
Probability and Cumulative Beta Distributions. 


underlying distribution function than the smaller sample. Empirically, the 
mean of the larger sample is 2.03 and the variance is 1.02. Given that the true 
underlying distribution is a T (a = 4, 6 = 2), the theoretical mean is 2 and 
the variance is 1. Hence, while there is little improvement in the estimate of 
the mean from the larger sample, the estimate of the variance for the larger 
sample is much closer to its true value. 

To develop the concept of sampling from a distribution, assume that we are 
interested in estimating the share of a household’s income spent on housing. 
One possibility for this effort is the beta distribution, which is a two parameter 
distribution for a continuous random variable with values between zero and 
one (depicted in Figure 7.4). Assume that our population is the set of 40 
faculty of some academic department. Further assume that the true underlying 
beta distribution is the one depicted in Table 7.2. Assume that it is too costly 
to sample all 40 faculty members for some reason and that we will only be able 
to collect a sample of 8 faculty (i.e., there are two days remaining in the spring 
semester so the best you can hope for is to contact 8 faculty). The question 
is how does our sample of eight faculty relate to the true beta distribution? 

First, assume that we rank the faculty by the percent of their income 
spent on housing from the lowest to the highest. Next, assume that we draw 
a sample of eight faculty members from this list (or sample) at random. 


s = {34, 27, 19, 29, 33, 12, 23, 35}. (7.3) 


Taking the first point, 34/40 is equivalent to a uniform outcome of 0.850. 
Graphically, we can map this draw from a uniform random outcome into a 
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TABLE 7.2 
Density and Cumulative Density Functions for Beta Distribution 
e f(ala,8) F(ala,6)) e f(zja,8) F(ala,6)) 
0.025 = 0.2852 0.0036 0.525 1.4214 0.7240 
0.050 0.5415 0.0140 0.550 1.3365 0.7585 
0.075 = 0.7701 0.0305 0.575 1.2463 0.7908 
0.100 0.9720 0.0523 0.600 1.1520 0.8208 
0.125 1.1484 0.0789 0.625 1.0547 0.8484 
0.150 1.3005 0.1095 0.650 0.9555 0.8735 
0.175 1.4293 0.1437 0.675 0.8556 0.8962 
0.200 1.5360 0.1808 0.700 (0.7560 0.9163 
0.225 1.6217 0.2203 0.725 0.6579 0.9340 
0.250 1.6875 0.2617 0.750 0.5625 0.9492 
0.275 1.7346 0.3045 0.775 (0.4708 0.9621 
0.300 1.7640 0.3483 0.800 0.3840 0.9728 
0.325 1.7769 0.3926 0.825 0.3032 0.9814 
0.350 1.7745 0.4370 0.850 0.2295 0.9880 
0.375 1.7578 0.4812 0.875 0.1641 0.9929 
0.400 1.7280 0.5248 0.900 0.1080 0.9963 
0.425 1.6862 0.5675 0.925 0.0624 0.9984 
0.450 1.6335 0.6090 0.950 0.0285 0.9995 
0.475 1.5711 0.6491 0.975 0.0073 0.9999 
0.500 1.5000 0.6875 1.000 0.0000 1.0000 


beta outcome, as depicted in Figure 7.5, yielding a value of the beta random 
variable of 0.6266. This value requires a linear interpolation. The uniform 
value (i.e., the value of the cumulative distribution for beta) lies between 
0.8484 (a2 = 0.625) and 0.8735 (a = 0.650). 
0.650 — 0.625 
= 0.625 + (0.8500 — 0.8484) x ——_——____ = 0.6266. 7.4 
vi )* 0 8735 — 0.8484 cs 
Thus, if our distribution is true (B (a = 3, 6 = 2)), the 34th individual in the 
sample will spend 0.6266 of their income on housing. The sample of house 
shares for these individuals are then 


t = {0.6266, 0.4919, 0.3715, 0.5257, 0.6038, 0.2724, 0.4295, 0.6516}. (7.5) 


Table 7.3 presents a larger sample of random variables drawn according to 
the theoretical distribution. Figure 7.6 presents the sample and theoretical 
cumulative density functions for the data presented in Table 7.3. 

The point of the discussion is that a sample drawn at random from a 
population that obeys any specific distribution function will replicate that 
distribution function (the sample converges in probability to the theoretical 
distribution). The uniform distribution is simply the collection of all individ- 
uals in the population. We assume that each individual is equally likely to be 
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FIGURE 7.5 
Inverse Beta Distribution. 


TABLE 7.3 
Random Sample of Betas 


Obs. U[0,1] Bla,8) Obs. U[0,1] Bla,B) 


1 0.3900 = 0.3235 21 0.3944 0.38260 
2 0.8403 0.6177 22. 0.0503 0.0977 
3 0.3312 0.2902 23 0.5190 = 0.38967 
4 0.5652 0.4236 24 0.4487 0.3566 
5 0.7302 0.5295 25 0.7912 0.5753 
6 0.4944 0.3826 26 =©0.4874 (0.3785 
7 0.3041 0.2748 27 = 0.7320. 0.5307 
8 0.3884 0.3227 28 = =0.4588 =: 0.8623 
9 0.2189 0.2241 29 0.1510 0.1799 
10 =0.9842 = 0.8357 30 =-0.9094 = 0.6915 
11 0.8840 0.6616 31 (0.6834 = 0.4973 
12. 0.0244 0.0657 32 0.6400 0.4694 
13 0.0354 0.0806 330.6833 0.4973 
14 0.0381 0.0837 34 0.3476 ~=—-0..2996 
15 0.8324 ~=—0.6105 35 ~=—-0.3600 = 0.8066 
16 =: 0.0853 (0.13802 36 =—-0.0993 0.1417 
170.5128 0.38931 37 = 0.5149 0.3943 
18 0.7460 0.5409 38 0.7397 ~=—-0.5364 
19 0.4754 = =0.3717 39 ~=—-0.0593 0.1066 


20 0.0630 0.1101 40 0.4849 0.3771 
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FIGURE 7.6 
Sample and Theoretical Beta Distributions. 


drawn for the sample. We order the underlying uniform distribution in our 
discussion as a matter of convenience. However, given that we draw the sample 
population randomly, no assumption about knowing the underlying ordering 
of the population is actually used. 


7.2. Familiar Estimators 


As a starting point, let us consider a variety of estimators that students have 
seen in introductory statistics courses. For example, we start by considering 


the sample mean 
Xi. (7.6) 
i=l 


Based on this accepted definition, we ask the question — what do we know 
about the properties of the mean? Using Theorem 4.23, we know that 


E [X] =E[X] (7.7) 


X= 


sir 


which means that the population mean is close to a “center” of the distribution 
of the sample mean. Suppose V (X) = o? is finite. Then, using Theorem 4.17, 
we know that 


V(x)= 0 (7.8) 
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which shows that the degree of dispersion of the distribution of the sample 
mean around the population mean is inversely related to the sample size n. 
Using Theorem 6.22 (Khinchine’s law of large numbers), we know that 


plim X =EB[X). (7.9) 


n—-co 


If V (X) is finite, the same result follows from Equations 7.7 and 7.8 above 
because of Chebychev’s inequality. 
Other familiar statistics include the sample variance. 


Sk == 0 (Xe - 8) == xP - (8), (7.10) 


n+ 
t=1 
Another familiar statistic is the kth sample moment around zero 
meat 3 Xk (7.11) 
no Me : 
and the kth moment around the mean 
~ 1 =\k 
Mr =— Ney rae 
5 oy (Xi X) (7.12) 
As discussed in a previous example, the kth moment around the mean is used 


to draw conclusions regarding the skewness and kurtosis of the sample. In 
addition, most students have been introduced to the sample covariance 


1 


Cov (X,Y) = Say = = d (X; — X) (¥; -Y) (7.13) 
and the sample correlation 
Say 


= —=_.. 7.14 
Pay Cee ( ) 
In each case the student is typically introduced to an intuitive meaning of 
each statistic. For example, the mean is related to the expected value of 
a random variable, the variance provides a measure of the dispersion of 
the random variables, and the covariance provides a measure of the ten- 
dency of two random variables to vary together (either directly or indirectly). 
This chapter attempts to link estimators with parameters of underlying 
distributions. 


7.2.1 Estimators in General 


In general, an estimator is a function of the sample, not based on population 
parameters. First, the estimator is a known function of random variables. 


6 = $(X1, X2,X3,+++ Xn). (7215) 
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The value of an estimator is then a random variable. As with any other random 
variable, it is possible to define the distribution of the estimator based on 
distribution of the random variables in the sample. These distributions will 
be used in the next chapter to define confidence intervals. Any function of the 
sample is referred to as a statistic. Most of the time in econometrics, we focus 
on the moments as sample statistics. Specifically, we may be interested in the 
sample means, or may use the sample covariances with the sample variances 
to define the least squares estimator. 

Other statistics may be important. For example, we may be interested in 
the probability of a given die role (for example, the probability of a three). 
If we define a new set of variables, Y, such that Y = 1 if X = 3 and Y =0 
otherwise, the probability of a three becomes 


63 = Se (7.16) 


Amemiya [1, p. 115] suggests that this probability could also be estimated 
from the moments of the distribution. Assume that you have a sample of 50 
die rolls. Compute the sample distribution for each moment k = 0,1, 2,3, 4,5 


1 n 
=—_\°x* ‘ 
Mk 50 2 i (7.17) 


where X; is the value of the die roll. The method of moments estimate of each 
probability p; is defined by the solution of the five equation system 


6 
M$ (0) = 9> 4; 
j=l 
6 
1 ‘ 
M$? (0) = 9> 56; 
j=l 
6 
2 4 
M§” (0) = >> 76; 
re (7.18) 
Mp” (0) = S> 3°6; 
j=l 
6 
4 . 
Mj” (0) = 5-54; 
j=l 
6 
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TABLE 7.4 
Sample of Die Rolls 


Observation Outcome Observation Outcome Observation Outcome 


1 il 18 3 30 6 
2 5 19 5 36 A 
3 3 20 5 37 6 
4 2 21 3 38 2 
5 6 22 2 39 1 
6 5 23 a 40 3 
7 4 24 5 41 3 
8 2 25 1 42 5 
9 1 26 2 43 4 
10 3 27 ak 44 2 
11 1 28 6 45 1 
12 5 29 2 46 1 
13 6 30 1 AT 3 
14 1 31 6 48 3 
15 6 32 2 49 3 
16 5 33 2 50 3 
17 1 34 2 


where @ = (61, 62,03, 04,065,096) are the probabilities of rolling a 1,2,---6, 
respectively. Consider the sample of 50 observations presented in Table 7.4. 
We can solve for the method of moments estimator for the parameters in @ by 
solving 


6 50 
; 1 
My” (0) = 310; = tio = 5 = 
j=l i=1 
6 50 
MY (0) = S° j0; = 1m a yap S312 
j=l i=1 
(2) ~. a 1S > 
M; a 0; =the = = mi = 12.84 
ie: oy (7.19) 
(3) ( 
M$ =) == youn 
Mf (0 =D = tg = ay = 316.44 


M§” (0) = PG = 755 a oe = 1705.32. 


j=l i=1 
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The method of moments estimator can then be written as 


Kits a 1 1 1 : 
|| @ 3.12 
1 “o> 3 4 5 6 6 12.84 
1 8 27 64 125 216 |= | 6139 
1 16 81 256 625 1296 316.44 
: 
1 32 243 1024 3125 7776 a 1705.32 
(7.20) 

1 0.24 

dr, 0.20 

ES 63 | — | 0.20 

6, | | 0.06 

bs 0.16 

be 0.14 


Thus, we can estimate the parameters of the distribution by solving for that 
set of parameters that equates the theoretical moments of the distribution 
with the empirical moments. 

For another example of a method of moments estimator, consider the 
gamma distribution. The theoretical moments for this distribution are 


MY, = a8 MO) = af? (7.21) 
where M. sa) is the central moment. Using the data from Table 7.1, we have 


aB = 2.0331 1.2625 


af? = 1.2625 \ OMS pasts v2) 
Next, returning to the theoretical first moment 
2.0311 
a Fa: 2 
MOM = 9 6910 (7.23) 


Each of these estimators relies on sample information in the guise of the 
sample moments. Further, the traditional estimator of the mean and vari- 
ance of the normal distribution can be justified using a method of moments 
estimator. 


7.2.2. Nonparametric Estimation 


At the most general level, we can divide estimation procedures into distri- 
bution specific estimators and nonparameteric or distribution free estima- 
tors. Intuitively, distribution specific methods have two sources of informa- 
tion — information from the sample and information based on distributional 
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assumptions. For example, we can assume that the underlying random vari- 
able obeys a normal distribution. Given this assumption, the characteristics 
of the distribution are based on two parameters — the mean and the variance. 
Hence, the estimators focus on the first two moments of the sample. The es- 
timation procedure can be tailored to the estimation of specific distribution 
parameters. 

Extending our intuitive discussion, if the distributional assumption is cor- 
rect, tailoring the estimation to parameters of the distribution improves our 
ability to describe the random variable. However, if our assumption about 
the distribution form is incorrect, the distribution specific characteristics of 
the estimator could add noise or confuse our ability to describe the dis- 
tribution. For example, suppose that we hypothesized the random variable 
was normally distributed but the true underlying distribution was negative 
exponential 


f (a|A) = de>” x > Vand 0 otherwise. (7.24) 


The negative exponential distribution has a theoretical mean of 1/A and a 
variance of 1/\?. Hence, the negative exponential distribution provides more 
restrictions than the normal distribution. 

Nonparametric or distribution free methods are estimators that are not 
based on specific distributional assumptions. These estimators are less efficient 
in that they cannot take advantage of assumptions such as the relationship 
between the moments of the distribution. However, they are not fragile to 
distributional assumptions (i.e., assuming that the distribution is a normal 
when it is in fact a gamma distribution could significantly affect the estimated 
parameters). 


7.3 Properties of Estimators 


In general, any parameter such as a population mean or variance (i.e., the 
y and o? parameters of the normal distribution) may have several different 
estimators. For example, we could estimate a simple linear model 


Yi = A0 + 4 X1j + AQXQi + Vj (7.25) 


where y;, %1;, and x2; are observed and ao, a1, and a2 are parameters us- 
ing ordinary least squares, maximum likelihood, or a method of moments 
estimator. Many of the estimators are mathematically similar. For example, 
if we assume that the error in Equation 7.25 is normally distributed, the 
least squares estimator is also the maximum likelihood estimator. In the cases 
where the estimators are different, we need to develop criteria for comparing 
the goodness of each estimator. 
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7.3.1 Measures of Closeness 


As a starting point for our discussion, consider a relatively innocuous criteria 
— suppose that we choose the parameter that is close to its true value. For 
example, suppose that we want to estimate the probability of a Bernoulli event 
being 1 (i.e., the probability that the coin toss results in a head). The general 
form of the distribution becomes 


f(Z2|0) =067 (1-9)0-?), (7.26) 


Next, assume that we develop an estimator 


Se (7.27) 


i=l 


x= 


Z| 


where z;s are observed outcomes where z; = 1 denotes a head and z; = 0 
denotes a tail. Next, suppose that we had a different estimator 


N 
Y=) > wiz (7.28) 
t=1 


where w; is a weighting function different from 1/N. One question is whether 
X produces an estimate closer to the true @ than Y. 
Unfortunately, there are several different possible measures of closeness: 


1. P(|X—6| <|Y —4]) =1. 


2. E[g(X — 6)] < E[g(Y — @)] for every continuous function g(.) which is 
nonincreasing for x < 1 and nondecreasing for x > 0. 


3. E[g (|X — @|)] < E[g (|Y — @|)] for every continuous function and nonde- 
creasing g (.). 


4, P(|X —0| >) < P(|Y — 6| > «€) for every e. 
5 BO 6) = BY 6) 
6. P(|X —6| <|Y —4|) > P(|X — 6| > |Y —4)). 


Of these possibilities, the most widely used in econometrics are minimize mean 
error squared (5) and a likelihood comparison (akin to 2). 


7.3.2. Mean Squared Error 


To develop the mean squared error comparison, we will develop an example 
presented by Amemiya [1, p. 123]. Following our Bernoulli example, suppose 
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that we have a sample of two outcomes and want to estimate 6 using one of 
three estimators, 


S= 2 (7.29) 


Note that the first estimator corresponds with the estimator presented in 
Equation 7.27, while the second estimator corresponds with the estimator in 
Equation 7.28 with w; = 1 and w2 = 0. The third estimator appears ridiculous 
—no matter what the outcome, I think that @ = 1/2. 

As a starting point, consider constructing a general form of the mean 
squared error for each estimator. Notice that the probability of a single event 
z, in the Bernoulli formulation becomes 

f (ale) =08 (1-Ayo-*) > (Maren 6 (7.30) 
f(a =0/@)=(1-8). 
Given the probability function presented in Equation 7.30, we can express the 
expected value of estimator S as 


E[S (z1)] = (21 =1) x 9+ (41 =0) (1-0) =8. (7.31) 


Thus, even though the estimator always estimates either a zero (z, = 0) or 
a one (z; = 1), on average it is correct. However, the estimate may not be 
very close to the true value using the mean squared error measure. The mean 
squared error of the estimate for this estimator can then be written as 


MSEs (6) = )> f (21/8) (S (a) — 9)” 


= f(a = 0/6) (0-6) + f(a =1/9) 1-0)" (7.32) 


= (1-6) x #+0x (1-8)”. 


Next, consider the same logic for estimator 7. In the case of T, there are 
three outcomes: 2; + 22 = 0, 2] + z2 = 1, and z, + z2 = 2. In the case of 
zy + Zz = 1, either z} = 1 and z = 0 or z; = O and zg = 1. In other 
words, there are two ways to generate this event. Following our approach in 
Equation 7.30, we write the distribution function as 


eile =1)0) =e? 
f(a =1,22 = 0/6) =0(1- 8) 


f (21, 22| 0) = p(21+22) (1 _ g) 1-41-22) = 
f (2 =0,22 = 1/8) =0(1-8) 


f (4 =0, 2 = 0/6) = (1-9). 
(7.33) 
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The expected value of the estimator JT can then be derived as 


zy +22 =0 
2 


(1 gy 423 FBG 9) 4 


E[T (a1, 22)] = 


(7.34) 
Z4+Z= 2 


2_ (pn _ 2 2 
5 OP = (0-0) +P = 8. 


Again, the expected value of the estimator is correct, but the estimator only 
yields three possible values — T (z1, 22) = 0, T (21, 22) = 0.5, or T (21, 22) =1. 
We derive the mean squared error as a measure of closeness. 


MSEr (0) = > f (21 + 22/6) (L (21,22) — 8)” 


Zit+z2 


= (1-6)? (0-6)? + 20(1 0 (5 0) +6? (1-0) (7.35) 


-2(a 0) 6 + 0(1 0(; ‘)). 


Finally, for completeness we define the mean squared error of the W esti- 
mator as 


1 1 - 
MSEw (6) = (1 — @) (5 - 6) +6 € - ) ; (7.36) 
The mean squared error for each estimator is presented in Figure 7.7. The 
question (loosely phrased) is then which is the best estimator of @? In answer- 
ing this question, however, two kinds of ambiguities occur. For a particular 
value of the parameter, say 6 = 3/4, it is not clear which of the three estima- 
tors is preferred. T dominates W for 6 = 0, but W dominates T for 6 = 1/2. 


Definition 7.1. Let X and Y be two estimators of 6. We say that X is better 
(or more efficient) than Y if E(X — 6)” < E(Y — 6)’ for all @ € © and strictly 
less than for at least one 6 € O. 


When an estimator is dominated by another estimator, the dominated esti- 
mator is inadmissable. 


Definition 7.2. Let 4 be an estimator of 6. We say that 6 is inadmissible if 
there is another estimator which is better in the sense that it produces a lower 
mean square error of the estimate. An estimator that is not inadmissible is 
admissible. 


Thus, we assume that at least one of these estimators is inadmissible and 
in fact T always performs better than $, so S' is dominated and inadmissible. 
However, this criterion does not allow us to rank S' and W. 
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FIGURE 7.7 
Comparison of MSE for Various Estimators. 


7.3.3 Strategies for Choosing an Estimator 


Subjective strategy: This strategy considers the likely outcome of @ and selects 
the estimator that is best in that likely neighborhood. Minimaz Strategy: Ac- 
cording to the minimax strategy, we choose the estimator for which the largest 
possible value of the mean squared error is the smallest. 


Definition 7.3. Let 6 be an estimator of @. It is a minimax estimator if, for 
any other estimator of 0, we have 


max E (0 = 6) | < max E (4 = 6) | (7.37) 


Returning to our previous example, T is chosen over W according to the 
minimax strategy because the maximum MSE for T is 0.10 while the maximum 
MSE for W is 0.25. 


7.3.4 Best Linear Unbiased Estimator 


To begin our development of the best linear unbiased estimator, we need to 
develop the concept of an unbiased estimator in Definition 7.4. 


Definition 7.4. 6 is said to be an unbiased estimator of @ if E (4| for all 
6€ 0. We call E (6 _ 6| the bias. 
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In our previous discussion T’ and S are unbaised estimators while W is biased. 


Theorem 7.5. The mean squared error is the sum of the variance and the 
bias squared. That is, for any estimator 0 of 0, 


, 2 ‘ 2 
E (0-8) =V(0)+ (E (6) - 6) (7.38) 
Next, consider an unbiased estimator of the mean 


Comparing the T and S estimators, for T, a; = 1/2 = 1/N while for S, a, = 1 
and ag = 0. The conjecture from our example was that T was better than S. 
It produced a lower MSE or a lower variance of the estimate. To formalize 
this conjecture, consider Theorem 7.6. 


Theorem 7.6. Let {X;},i=1,2,--- ,N be independent and have a common 

. 2 . . . . 
mean tt and variance o~. Consider the class of linear estimators of w which 
can be written in the form 


N 
p=X= 5. wx: (7.40) 
and impose the unbiasedness condition 


N 
E > 0] = (7.41) 
Then 
N 
V(X) <V (>: 0%] (7.42) 


for all a;, satisfying the unbiasedness condition. Further, this condition holds 
with equality only for a; =1/N. 


To prove these points, note that the a;s must sum to one for unbiasedness. 


N N N N 
E > 0] =SlaE[X] =) aip=pS ai. (7.43) 
i=l t=1 i=1 i=1 


Therefore, My a; = 1 results in an unbiased estimator. The final condition 
can be demonstrated through the identity 


N 1 2 N 9 N 7 
¥ («-¥) =e ae (7.44) 
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If a; = 1/N then yan (a; —1/N)* = 0. Thus, any other a; must yield a 
higher variance oo (a; — 1/N)? > 0) 
N 


1\2 N 1 
2 22 2 
S («:- x) o7 >0=> y azo" > (=) o*. (7.45) 


i=l i=l 


Theorem 7.7. Consider the problem of minimizing 
N 
Soa? (7.46) 
i=1 
with respect to {a;} subject to the condition 
So ajby = 1. (7.47) 


The solution to this problem is given by 


bi 


=>: 
a5 
j=l 


(7.48) 


ay 


Proof. Consider the Lagrange formulation for this minimization problem 


N N 
L=S ata ( -S> vt (7.49) 
i=1 t=1 


yielding the general first order condition 


OL 
Substituting this result back into the constraint 
N 
mn 2 2 
1-5 =0> A= | ’ (7.51) 


i=1 2 
D0; 
i=1 


Substituting Equation 7.51 into Equation 7.50 yields Equation 7.48. Holding 


2a; 2a; Sas 
= = 52 
a 5 b; Vi, J (7.52) 


implying that 6; =b; =1>a;=1/N. 


Thus, equally weighting the observations yields the minimum variance es- 
timator. 
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7.3.5 Asymptotic Properties 


Unbiasedness works well for certain classes of estimators such as the mean. 
Other estimators are somewhat more complicated. For example, the maximum 
likelihood estimator of the variance can be written as 


6? = } (a, — 2). (7.53) 


However, this estimator is biased. As we will develop in our discussion of the 
x? distribution, the unbiased estimator of the variance is 


N 


1 
a= aT 2 (x, — 2). (7.54) 
i=l 

Notice that as the sample size becomes large the maximum likelihood esti- 
mator of the variance converges to the unbiased estimator of the variance. 
Rephrasing the discussion slightly, the maximum likelihood estimator of the 
variance is a consistent estimator of the underlying variance, as decribed in 
Definition 7.8. 


Definition 7.8. We say that 6 is a consistent estimator of 6 if 


plim,, ,.4 = 9. (7.55) 
Certain estimators such as Bayesian estimators are biased, but consistent. As 
the sample size increases, the parameter will converge to its true value. In the 
case of the Bayesian estimator the bias introduced by the prior becomes small 


as the sample size expands. 


7.3.6 Maximum Likelihood 


The basic concept behind maximum likelihood estimation is to choose that set 
of parameters that maximizes the likelihood of drawing a particular sample. 
For example, suppose that we know that a sample of random variables has a 
variance of one, but an unknown mean. Let the sample be X = {5,6,7,8, 10}. 
The probability of each of these points based on the unknown mean (1) can 


be written as ‘ 
1 5- 
(Sls) = oo | =e | 


f (6|u) = e| enw) | (7.56) 
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Assuming that the sample is independent so that the joint distribution func- 
tion can be written as the product of the marginal distribution functions, the 
probability of drawing the entire sample based on a given mean can then be 
written as 
1 (5—y)? (6-H) (10 — 4)? 
L(X|u) = 


as (757 
(on) 2 2 D 9 On) 


The function L(X|y) is typically referred to as the likelihood function. 
The value of yw that maximizes the likelihood function of the sample can then 
be defined by 

max L(X|p). (7.58) 


Under the current scenario, we find it easier, however, to maximize the natural 
logarithm of the likelihood function 
a (5-y)  (6-p) (10 — 1)? 


max In (L (X|14)) = ai K 5 5 Se 5 


=—(5—p)—(6—p)—---G0-p) =0 


5+6+7+84+9+10 _ 


7.9 
6 


LMLE = 

(7.59) 

where kK = —5/21n (27). Note that the constant does not affect the estimate 
(i.e., the derivative 0K /Ou = 0). 


7.4 Sufficient Statistics 


There are a number of ways to classify statistical distribution. One of the most 
popular involves the number of parameters used to specify the distribution. 
For example, consider the set of distribution functions with two parameters 
such as the normal distribution, the gamma distribution, and the beta distri- 
bution. Each of these distributions is completely specified by two parameters. 
Intuitively, all their moments are functions of the two identifying parameters. 
The sufficient statistic is the empirical counterpart of this concept. Specifically, 
two empirical moments of the distribution contain all the relevent information 
regarding the distribution. Put slightly differently, two empirical moments (or 
functions of those moments) are sufficient to describe the distribution. 


7.4.1 Data Reduction 


The typical mode of operation in statistics is to use information from a sample 
X ,°::Xy to make inferences about an unknown parameter 0. The researcher 
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summarizes the information in the sample (or the sample values) with a statis- 
tic. Thus, any statistic T (X) summarizes the data, or reduces the information 
in the sample to a single number. We use only the information in the statistic 
instead of the entire sample. Put in a slightly more mathematical formulation, 
the statistic partitions the sample space into two sets defining the sample space 
for the statistic 

T={t:t=T(2),rEX}. (7.60) 


Thus, a given value of a sample statistic T (a) implies that the sample 
comes from a space of sets A; such that t € T, A; = {x: T(x) =t}. The 
second possibility (that is ruled out by observing a sample statistic of T (x)) 
is AC = {x: T(x) #t}. Thus, instead of presenting the entire sample, we 
could report the value of the sample statistic. 


7.4.2 Sufficiency Principle 


Intuitively, a sufficient statistic for a parameter is a statistic that captures all 
the information about a given parameter contained in the sample. Sufficiency 
Principle: If T (X) is a sufficient statistic for 0, then any inference about 0 
should depend on the sample X only through the value of T (X). That is, if « 
and y are two sample points such that T (x) = T (y), then the inference about 
@ should be the same whether X = x or X = y. 


Definition 7.9 (Cassela and Berger). A statistic T (X) is a sufficient statistic 
for 6 if the conditional distribution of the sample X given T(X) does not 
depend on 6 [7, p. 272]. 


Definition 7.10 (Hogg, McKean, and Craig). Let X1,X2,---X, denote a 
random sample of size n from a distribution that has a pdf (probability 
density function) or pmf (probability mass function) f (x|0), 6 € ©. Let 
Y, = uy (Xj, X2,---X;,) be a statistic whose pdf or pmf is fy, (y: |@). Then 
Y, is a sufficient statistic for #0 if and only if 


f (a1 |0) f (w2 10) ++ f (@n 18) 


iba [ui (11,2, “4 ‘Xn) 0] 


= H (#1, %2,--- Xn) (7.61) 


where H (#1, 2%2,---%n) does not depend on @ € O [18, p. 375]. 


Theorem 7.11 (Cassela and Berger). If p(x|0) is the joint pdf (probability 
density function) or pmf (probability mass function) of X and q(t|O) is the 
pdf or pmf of T (X), then T (X) is a sufficient statistic for 0 if, for every «x 
in the sample space, the ratio of 


p (29) 
q(T (2) |@) 


is a constant as a function of 6 [7, p. 274]. 


(7.62) 
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Example 7.12. Normal sufficient statistic: Let X1,---X, be independently 
and identically distributed N (nu, a”) where the variance is known. The sample 
mean T (X) = X =1/n)>7i_, X; is the sufficient statistic for p. Starting with 
the joint distribution function 


rol) =T] tegen | eo 


(7.63) 
_ 1 (ai =u)” 
(2 ae “oy y 207 
Next, we add and subtract Z, yielding 
7 1 "(aj —2+2@-p) 
f (z|n) = (on02)"? exp Ze oR 
i=1 
a aya &XP 
(2707) ie 20° 
where the last equality derives from 
>. (i — #) @— 4H) = (GH) | (ei — F) = 0. (7.65) 
i=1 i=1 
The distribution of the sample mean is 
1 n(z—p)* 
q(T (X)|6) = as exp eo!) | (7.66) 


2 202 
(22) 
n 


The ratio of the information in the sample to the information in the statistic 
becomes 


f(xla) _ (2n0?)"? = 
q(T (x) |@) 1 n(Z—p)° 
2 1/2 exp Ia2 
(-) (7.67) 
- 1 ofa (ti — HY 
etn | tae 


which does not depend on p. 
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Theorem 7.13 (Cassela and Berger, Factorization Theorem). Let f («|0) 
denote the joint pdf (probability density function) or pmf (probability mass 
function) of a sample X. A statistic T (X) is a sufficient statistic for 6 if and 
only if there exist functions g(t|0) and h(x) such that, for all sample points 
x and all parameter points 6, 


f (x|@) = g (T (x) |0) h (x) (7.68) 
[7, p. 276]. 


Definition 7.14 (Cassela and Berger). A sufficient statistic T (X) is called 
a minimal sufficient statistic if, for any other sufficient statistic T’ (X), T (X) 
is a function of T’ (X) [7, p. 280]. 


Basically, the mimimal sufficient statistics for the normal are the sum of 
the sample observations and the sum of the sample observations squared. All 
the parameters of the normal can be derived from these two sample moments. 
Similarly, the method of moments estimator for the gamma distribution pre- 
sented in Section 7.2.1 uses the first two sample moments to estimate the 
parameters of the distribution. 


7.5 Concentrated Likelihood Functions 


In our development of the concept of maximum likelihood in Section 7.3.6 we 
assumed that we knew the variance of the normal distribution, but the mean 
was unknown. Undoubtedly, this framework is fictional. Even if we know that 
the distribution is normal, it would be a rare event to know the variance. Next, 
consider a scenario where we concentrate the variance out of the likelihood 
function. Essentially, we solve for the maximum likelihood estimate of the 
variance and substitute that estimate into the original normal specification to 
derive estimates of the sample mean. The more general form of the normal 
likelihood function can be written as 


b (xo?) =T] ego | aoaad (7.69) 


Ignoring the constants, the natural logarithm of the likelihood function can 


be written as 
n 


In (L) = —~ In (a?) — —5 } (ear (7.70) 
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This expression can be solved for the optimal choice of a? by differentiating 
with respect to o?. 


=> —no? + 5° (X;— 4)? =0 (7.71) 


Substituting this result into the original logarithmic likelihood yields 


mn) = Bn (2 yx 1) a (i - w? 
2 ' ) 


= -Sh (ty: (X; 1) - e 
(7.72) 


Intuitively, the maximum likelihood estimate of yw is that value that mini- 
mizes the mean squared error of the estimator. Thus, the least square estimate 
of the mean of a normal distribution is the same as the maximum likelihood 
estimator under the assumption that the sample is independently and identi- 
cally distributed. 


7.6 Normal Equations 


If we extend the above discussion to multiple regression, we can derive the 
normal equations. Specifically, if 


Yi = AQ + OX; + &% (7.73) 


where ¢; is distributed independently and identically normal, the concentrated 
likelihood function above can be rewritten as 


(7.74) 
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Taking the derivative with respect to ag yields 


n 


a n 

5) 3 ail 2Y; t 2a9 + 2a,2;] = 0 
by [wi — 9 — 012%) i=l 
jel 


>. Sou + Qo +t a> om = (7.75) 


Taking the derivative with respect to a; yields 


n 


S [ 20;3Y; + 200%; + 20127] =0 


S- [yi — Ao — axa)” = 
j=l (7.76) 


mee is, see loge 
= —— Didi + =a i + — om. 
i=l i=1 i=1 


Substituting for ag yields 


n 
2 


. i (7.77) 
vo(23e) (Bo) tg a 
Hence, 
™(Sem-E*| [Ea)) 
a nce (7.78) 
2 (Sa-[Eat 


Hence, the estimated coefficients for the linear model can be computed 
from the normal equations. 


7.7 Properties of Maximum Likelihood Estimators 


To complete our discussion of point estimators, we want to state some of the 
relevant properties of the general maximum likelihood estimator. First, the 
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maximum likelihood provides a convenient estimator of the variance of the 
estimated parameters based on the Cramer—Rao Lower Bound. 
Theorem 7.15 (Cramer—Rao Lower Bound). Let L (X1,X2,---X,|0) be the 
likelihood function and let 0(X1, X2,---Xn) be an unbiased estimator of 0. 
Then, under general conditions, we have 

- 1 
v (6) > -—_.. 
~ E [= In | 


oi 


(7.79) 


Intuitively, following the Lindeberg—Levy theorem, if the maximum likelihood 
estimator is consistent then the distribution of the estimates will converge to 
normality 


Theorem 7.16 (Asymptotic Normality). Let the likelihood function be 
L (X41, X2,-++Xy|0). Then, under general conditions, the maximum likelihood 
estimator of @ is asymptotically distributed as 


6AN (0. ae | (7.80) 


Using the second-order Taylor series expansion of the log of the likelihood 
function, 


In (L (6) © In (L (0p)) + ome) les (0 — 4%) 
= (7.81) 
1 @ In(L (6)) 3 
Neg (@ 65)" 


Letting 09 be the estimated value, as the estimated value approaches the true 
value (i.e., assume that 09 maximizes the log-likelihood function), 


oe) +0. (7.82) 
006 = 
To meet the maximization conditions 
07 In (L (0 
OO)! eo. (7.83) 
ola 0=00 


Taking a little freedom with these results and imposing the fact that the 
maximum likelihood estimator is consistent, 


In (L(0)) —In(L (6 A 
ae Tr | ( )) ~ (0 6) (7.84) 
ae? 


0=0 
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Taking the expectation of Equation 7.84, and then inserting the results into 
the characteristic function 


f(A) =exp | ad (0 6) ae (" = y (7.85) 


yields a result consistent with the Cramer—Rao lower bound. 


7.8 Chapter Summary 


e A basic concept is that randomly drawing from a population allows the 
researcher to replicate the distribution of the full population in the sample. 


e Many familiar estimators are point estimators. These estimators estimate 
the value of a specific sample parameter or statistic. Chapter 8 extends our 
discussion to interval estimators which allow the economist to estimate a 
range of parameter or statistic values. 


e There are a variety of measures of the quality of an estimator. In this 
chapter we are primarily interested in measures of closeness (i.e., how 
close the estimator is to the true population value). 


— One measure of closeness is the mean squared error of the estimator. 


— An estimator is inadmissable if another estimator yields a smaller or 
equal mean squared error for all possible parameter values. 


— There may be more than one admissable estimator. 


— Measures of closeness allow for a variety of strategies for choosing 
among estimators including the minimax strategy — minimizing the 
maximum mean squared error. 


e In econometrics we are often interested in the Best Linear Unbiased Esti- 
mator (BLUE). 


e An estimator is unbiased if the expected value of the estimator is equal to 
its true value. Alternatively, estimators may be consistent, implying that 
the value of the estimator converges to the true value as the sample size 
grows. 


e Sufficient statistics are the collection of sample statistics that are not de- 
pendent on a parameter of the distribution and contain all the informa- 
tion in the sample regarding a particular distribution. For example, the 
expected first and second moment of the sample are sufficient statistics 
for the normal distribution. 
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7.9 Review Questions 
7-1R. Describe the relationship between P (|T’ — 6| > «) < P (|S — 6| > €) and 
E(T— 0)? < E(S— 46)? using the convergence results in Chapter 6. 


7-2R. A fellow student states that all unbiased estimators are consistent, but 
not all consistent estimators are unbiased. Is this statement true or 
false? Why? 


7.10 Numerical Exercises 


7-1E. Using the distribution function 
3 
f(a) = (1 he )*) ,@ € (0,2) (7.86) 


generate a sample of 20 random variables. 


a. Derive the cumulative distribution function. 
b. Derive the inverse function of the cumulative distribution function. 


c. Draw 20 U [0,1] draws and map those draws back into x space using 
the inverse cumulative density function. 


d. Derive the mean and variance of your new sample. Compare those 
values with the theoretical value of the distribution. 


7-2E. Extend the estimation of the Bernoulli coefficient (0) in Section 7.3.2 to 
three observations. Compare the MSE for two and three sample points 
graphically. 


7-3E. Using a negative exponential distribution 


f (a A) = Aexp (-Az) (7.87) 


compute the maximum likelihood estimator of A using each column in 
Table 7.5. 


7-4E. Compute the variance for the estimator of \ in Exercise 7-3E. 
7-5E. Compute the normal equations for the regression 
Ye = MH +1, + & (7.88) 


where y; is the interest rate on agricultural loans to Florida farmers 
and 2; is the interest rate on Baa Corporate bonds in Appendix D. 


182 Mathematical Statistics for Applied Econometrics 


TABLE 7.5 
Exercise 7-3E Data 


oe) 


CONBURWNHHS 


1 2 3 
10.118 1.579 0.005 
3.859 0.332 0.283 
1.291 0.129 0.523 
0.238 0.525 0.093 
3.854 0.225 0.177 
0.040 2.855 0.329 
0.236 0.308 0.560 
1.555 2.226 0.094 
5.013 0.665 0.084 

10 1.205 1.919 0.041 

11 0.984 0.088 0.604 

12 2.686 0.058 1.167 

13° 7.477 =0.097 0.413 

14 14.879 0.644 0.077 

15 1.290 0.203 0.218 

16 83.907 2.618 0.514 

17 2.246) =0.059 0.3825 

18 5.173 0.052 0.270 

19 = =2.052 1.871 0.134 

20. =8.649 0.783 0.072 

21 6.544 0.603 0.186 

22 =©6.297 =0.189 0.099 

23 4.640 0.260 0.389 

24 10.924 0.677 0.088 

25 10.377 2.259 0.187 


mn 
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As we discussed when we talked about continuous distribution functions, the 
probability of a specific number under a continuous distribution is zero. Thus, 
if we conceptualize any estimator, either a nonparametric estimate of the 
mean or a parametric estimate of a function, the probability that the true 
value is equal to the estimated value is obviously zero. Thus, we usually talk 
about estimated values in terms of confidence intervals. As in the case when 
we discussed the probability of a continuous variable, we define some range of 
outcomes. However, this time we usually work the other way around, defining a 
certain confidence level and then stating the values that contain this confidence 
interval. 


8.1 Confidence Intervals 


Amemiya [1, p. 160] notes a difference between confidence and probability. 
Most troubling is our classic definition of probability as “a probabilistic state- 
ment involving parameters.” This is troublesome due to our inability, without 
some additional Bayesian structure, to state anything concrete about proba- 
bilities. 


Example 8.1. Let X; be distributed as a Bernoulli distribution, i = 


1,2,---.N. Then 
= A 6(1- 8) 
T= xX ~N|0,——— }. wl 
(0.2 (8.1) 
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TABLE 8.1 
Confidence Levels 


k 7/2 Y 
1.0000 0.1587 0.3173 
1.5000 0.0668 0.1336 
1.6449 0.0500 0.1000 
1.7500 0.0401 0.0801 
1.9600 0.0250 0.0500 
2.0000 0.0228 0.0455 
2.3263 0.0100 0.0200 


Breaking this down a little more — we will construct the estimate of the 
Bernoulli parameter as 


ae, spe 
T=X=5) Xi (8.2) 


i=1 
where T = 6. If the X;, are independent, then 


1 1 
V(T) = wy (X;) = we (1-6). (8.3) 
Therefore, we can construct a random variable Z that is the difference between 
the true value of the parameter @ and the value of the observed estimate. 
EE SS ON (8.4) 
6 (1-8) 
N 
Why? By the Central Limit Theory. Given this distribution, we can ask ques- 
tions about the probability. Specifically, we know that if Z is distributed 
N (0,1), then we can define 


YR = P(|Z| <k). (8.5) 


Essentially, we can either choose a k based on a target probability or we can 
define a probability based on our choice of k. Using the normal probability, the 
one tailed probabilities for the normal distribution are presented in Table 8.1. 
Taking a fairly standard example, suppose that I want to choose a k such 
that 7/2 = 0.025, or that we want to determine the values of k such that 
the probability is 0.05 that the true value of y will lie outside the range. The 
value of k for this choice is 1.96. This example is comparable to the standard 
introductory example of a 0.95 confidence level. 


The values of 7, can be derived from the standard normal table as 


|Z — 9| 
6(1— 0) 


n 


Interval Estimation 185 


Assuming that the sample value of T is t, the confidence interval (C[.]) is 
defined by 


|Z — 6 
6(1— 8) 
N 


= Yk. (8.7) 


Building on the first term, 


pea ey =P |i-a<s a) 


6(1—6) N 
N 
6(1— 9) 
= _ 9)\2 2 
=P li 0) <k (8.8) 
hk? ik 
= 2 L Q2 ; 2 
=P|t 210 + 6? — 04 ed 


=| (145) +0(21+ 5) +P <d < 
— N N Vk: 


Using this probability, it is possible to define two numbers hj (t) and hg (t) 
for which this inequality holds. Mathematically, applying the quadratic equa- 
tion, 


P [hy (t) <p< ho (t)] < VE where 


k R2\? k? 
4+—+ + | 2 8.9 
Qt + (2 ) 4 (1 ) t (8.9) 


hy (t) ho (t) = 


In order to more fully develop the concept of the confidence interval, con- 
sider the sample estimates for four draws of two different Bernoulli distribu- 
tions presented in Table 8.2. The population distribution for the first four 
columns is for 6 = 0.40 while the population distribution for the second four 
columns holds # = 0.80. Further, the samples are nested in that the sample of 
100 for draw 1 includes the sample of 50 for draw 1. Essentially, each column 
represents an empirical limiting process. 

Starting with draw 1 such that 0 = 0.40, the sample value for t is 0.3800. 
In this discussion, we are interested in constructing an interval that contains 
the true value of the parameter with some degree of confidence. The question 
is, what are our alternatives? First, we could use the overly simplistic version 
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of the standard normal to conclude that 
0 € (€— 1.96 x S;,£+ 1.96 x S;) (8.10) 


where 1.96 corresponds to a “two-sided” confidence region under normality. 
Obviously the range in Equation 8.10 is much too broad (i.e., the range in- 
cludes values outside legitimate values of 0). Why is this the case? The con- 
fidence interval implicitly assumes that ¢ is normally distributed. Next, if we 
use the estimate of the variance associated with the Bernoulli distribution in 
Equation 8.4, we have 


tlt 1—t 
ne (1-196 y ve Le 96% [O2) 06 (0245.05 


(8.11) 
for N = 50 of draw 1. This interval includes the true value of 6, but we 
would not know that in an application. Next, consider what happens to the 
confidence interval as we increase the number of draws to N = 100. 


6 € (0.2472, 0.4328) . (8.12) 


Notice that the confidence region is somewhat smaller and still contains the 
true value of 6. 

Next, we consider the confidence interval computed from the results in 
Equation 8.9 for the same distributions presented in Table 8.3. In this case 
we assume k = 0.95 as in Equations 8.11 and 8.12. However, the linear term 
in Table 8.3 is computed as 


= (8.13) 


while the square root involves the 


ev k? 
ass _ a a 2 
(vB) (48): 
k? 
2(1+5) 


term. The lower and upper bounds are then computed by adding and sub- 
tracting Equation 8.14 from Equation 8.13. In the case of N = 50, the confi- 
dence interval is (0.3175, 0.4468) while for N = 100, the confidence interval is 
(0.2966, 0.3863). 

It is clear that the values of the confidence intervals are somewhat differ- 
ent. In practice, the first approach, based on the limiting distribution of the 
maximum likelihood formulation, is probably more typical. 

Next, consider the confidence interval for the mean of a normally dis- 
tributed random variable where the variance is known. 


(8.14) 
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Example 8.2. Let X; ~ N (1,07), i= 1,2,--+n where p is unknown and o? 
is known. We have 


o~ o? 
T=X~N —}. 8.15 
(5) (8.15) 
Define 

—_ 
pe Bel ea, (8.16) 

o 

N 


Example 8.2 is neat and tidy, but unrealistic. If we do not know the mean 
of the distribution, then it is unlikely that we will know the variance. Hence, 
we need to modify the confidence interval in Example 8.2 by introducing the 
Student’s t-distribution. 


Example 8.3. Suppose that X; ~ N (,07), i = 1,2,---n with both py and 
a? unknown. Let 


THX (8.17) 
be an estimator of uw and 
1 a 
s? =—)- (x;-%) (8.18) 
i=1 
be the estimator of o?. Then the probability distribution is 
trn-1 = S71 (T-1)Vn-1. (8.19) 


This distribution is known as the Student’s t-distribution with n—1 degrees of 
freedom. 


Critical to our understanding of the Student’s t-distribution is the amount of 
information in the sample. To develop this, consider a simple two observation 
sample 

2 (eek) eee x) 


Xb NSN Be oo 
=(% Tr :) ate 1+ 2) 


2 2 
8.20 
tt - XX)? (8.20) 
~\ 2 2 '\ 9 2 
1 2 
= 5 (X1 — Xa)". 


Thus, two observations on X; only give us one observation on the variance 
after we account for the mean — two observations only give us one degree 
of freedom on the sample variance. Theorem 8.4 develops the concept in a 
slightly more rigorous fashion. 
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Theorem 8.4. Let X,, X2,--:Xy be a random sample from aN (u, 07) dis- 
tribution, and let 


X= 


Sle 


n 1 3% 
a and S? = 4 a (X;—X)°. (8.21) 


i=1 
Then 


a) X and S? are independent random variables. 
b) X ~N (p,0?/n). 


c) (n— 1) $?/o? has a chi-squared distribution with n—1 degrees of freedom. 


The proof of independence is based on the fact that $? is a function of 
the deviations from the mean which, by definition, must be independent of 
the mean. More interesting is the discussion of the chi-squared statistic. The 
chi-squared distribution is defined as 


1 i 
ee es (8.22) 


In general, the gamma distribution is defined through the gamma function 


T (a) = [ foot et dts (8.23) 


Dividing both sides of the expression by T' (a) yields 


1 te 


i= ma / ote a SF (i) = (8.24) 
(a) Jo (a) 
Substituting X = (Ct gives the traditional two parameter form of the distri- 


bution function 1 
f (ela, 8) = aeey ga Ce ae (8.25) 


The expected value of the gamma distribution is a3 and the variance is a?. 


Lemma 8.5 (Facts about chi-squared random variables). We use the notation 
x2 to denote a chi-squared random variable with p degrees of freedom. 


e If Z is aN(0,1) random variable, then Z* ~ y?, that is, the square of a 
standard normal random variable is a chi-squared random variable. 


e If X1,X2,---Xn are independent, and X? ~ Dem then X2+X34+---X2 ~ 
Ne aha eaepies that is, independent chi-squared variables add to a chi-squared 
variable, and the degrees of freedom also add. 
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The first part of Lemma 8.5 follows from the transformation of random 
variables for Y = X*, which yields 


fy) = A peices: (8.26) 


Returning to the proof at hand, we want to show that (n — 1) $?/o? has a 
chi-squared distribution with n — 1 degrees of freedom. To demonstrate this, 
note that 


1 n 
2 — —— — 
3-4 y [x-iyy, 
= j=l 
2 i 2 
lee 1 ” EL 1 
(n—1)52=|{X,-—S)°X;--X,] + X;-=S0 Xj; -=Xn 
j=l Me w=1 j=l 
2 
n-1 n-1 
(n—1) ays. “7 1 1 
(v1) 82=( Xn———Kn-1) + Xi- — 9) Xj] - = Xn 
i=l j=l 
(8.27) 
If n = 2, we get 
1 
3 = 5 (%2—- Xi)’. (8.28) 


Given (X2 — X1) /V/2 is distributed N (0,1), $3 ~ x? and by extension for 
W= hy (R- 1 Sh 25: 

Given these results for the chi-squared, the distribution of the Student’s t 
then follows. 


: (X =n) 
At eas. (8.29) 
- 


Note that this creates a standard normal random variable in the numerator 
and a random variable distributed 


Xe A 
n—-1 


(8.30) 


in the denominator. The complete distribution is found by multiplying the 
standard normal times the chi-squared distribution times the Jacobian of the 
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: CS) : . (8.31) 


transformation yields 


8.2 Bayesian Estimation 


Implicitly in our previous discussions about estimation, we adopted a clas- 
sical viewpoint. We had some process generating random observations. This 
random process was a function of fixed, but unknown parameters. We then 
designed procedures to estimate these unknown parameters based on observed 
data. Specifically, if we assumed that a random process such as students ad- 
mitted to the University of Florida generated heights, then this height process 
can be characterized by a normal distribution. We can estimate the parameters 
of this distribution using maximum likelihood. The likelihood of a particular 
sample can be expressed as 


1 ie 5 
L (X1, Xo, see Xn |b, a”) = (an)? = exp 902 2, (Xi LL) | . (8.32) 


Our estimates of ys and o? are then based on the value of each parameter that 
maximizes the likelihood of drawing that sample. 

Turning this process around slightly, Bayesian analysis assumes that we 
can make some kind of probability statement about parameters before we 
start. The sample is then used to update our prior distribution. First, assume 
that our prior beliefs about the distribution function can be expressed as a 
probability density function z (0) where @ is the parameter we are interested 
in estimating. Based on a sample (the likelihood function), we can update our 
knowledge of the distribution using Bayes rule. 


L (X10) 7 (0) 
a L (X|6) 7 (0) do 


—oCo 


7 (0|X) = (8.33) 


To develop this concept, assume that we want to estimate the probability of 
a Bernoulli event (p) such as a coin toss. The standard probability is then 


P [x] =p (1—p)™. (8.34) 


However, instead of estimating this probability using the sample mean, we 
use a Bayesian approach. Our prior is that p in the Bernoulli distribution is 
distributed B (a, 8). 
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The beta distribution is defined similarly to the gamma distribution. 


rok a-1 B-1 
f(plab)=5 (a8)? (apy (8.35) 
B (a, 8) is defined as 
1 
= ae PD sien — (a) T (8) 

B(a,8) = | g°-* (1—a) a Ta By (8.36) 

Thus, the beta distribution is defined as 

on EBS ep B-1 

f (pla, B) = T(a)r (8) : (1—p) : (8.37) 


Assume that we are interested in forming the posterior distribution after 
a single draw. 


x Tes oe Oe op ae ae 
aes p* (1—p) Tar (sy? (1—p) 
ee oe TOSS) gate), Shad 
fe aay ep) ap a 
pte (1—=p)? * 


— T . 
| pete (1 _ py dp 
0 


Following the original specification of the beta function, 
1 1 
i pXte-1 (1 —p)P* dp =} p* (L—p)” "dp 
0 0 
where a* = X + aand B* =8-X+1 (8.39) 


1 
Xto-1/, _ ,\8B-X y P(X +a)T(B-X +1) 
=f ee a T(at+B+l) 


The posterior distribution (the distribution of P after the value of the obser- 
vation is known) is then 


T(a+6+1) setae Nea 
T(X +a)T(6-X +1) 
The Bayesian estimate of p is then the value that minimizes a loss function. 
Several loss functions can be used, but we will focus on the quadratic loss 
function consistent with the mean squared error. 


ab [(p- »)”| 


7 (p|X) = ae (8.40) 


min |(p—p)"] = = 26 [p — p] =0 
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Taking the expectation of the posterior distribution yields 


1 T(a+6+4+1) 
Epi = [ hxsteare=x 41) 


pt (1 — p)?™~* dp 
(8.42) 


Tiat+f+1 Dy ait = 
( ) ee (1—p)”* dp. 
T(X +a)T(6-—X +1) Jo 

As before, we solve the integral by creating a* = a+X+1 and §* = G—-X+1. 
The integral then becomes 


‘| pees (1 — p/P -1 dp = Tas 

' (8.43) 
_T(at+X4+1)P(@-X +1) 
7 T'(a+8+2) 


Hence, 
Pete epTretx 401 G= x94) 
Ell=Tiete42) TiatX) T(e-X+1) ee) 


which can be simplified using the fact 


T'(a+1)=al (7). (8.45) 
Therefore 
I (ere) fos axed T(a+6+1) 
T(a+8+2) T(a+X) § (a+84+1T(a+841) 
x 2 anes 5} (8.46) 
— (a+X) 
~ (a+ 8+1) 


To make this estimation process operational, assume that we have a prior 
distribution with parameters a = @ = 1.4968 that yields a beta distribution 
with a mean p of 0.5 and a variance of the estimate of 0.0625. Next assume 
that we flip a coin and it comes up heads (X = 1). The new estimate of p 
becomes 0.6252. If, on the other hand, the outcome is a tail (X = 0), the new 
estimate of p is 0.3747. 

Extending the results to n Bernoulli trials yields 


T n Vy 
CES GICs ae art Cae 
where Y is the sum of individual Xs or the number of heads in the sample. 
The estimated value of p then becomes 
Y+a 
at+G+n 


7 (p|X) = 


p= (8.48) 
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If the first draw is Y = 15 and n = 50, then the estimated value of p is 
0.3112. This value compares with the maximum likelihood estimate of 0.3000. 
Since the maximum likelihood estimator in this case is unbiased, the results 
imply that the Bayesian estimator is biased. 


8.3. Bayesian Confidence Intervals 


Apart from providing an alternative procedure for estimation, the Bayesian 
approach provides a direct procedure for the formulation of parameter confi- 
dence intervals. Returning to the simple case of a single coin toss, the proba- 
bility density function of the estimator becomes 


T(a+6+1) paar 


IX) = Tx arg-x4+)) 


=o. 
py *. (8.49) 
As previously discussed, we know that given a = 6 = 1.4968 and a head, the 
Bayesian estimator of p is 0.6252. However, using the posterior distribution 
function, we can also compute the probability that the value of p is less than 
0.5 given a head. 


T(a+6+1) 


0.5 
= X+a-1 p-X 
Pip<os|= | Tx orgexs aii ee a 


(8.50) 
= 0.2976. 


Hence, we have a very formal statement of confidence intervals. 


8.4 Chapter Summary 


e Interval estimation involves the estimation of a range of parameter values 
that contains the true population value. This range is typically referred to 
as the confidence interval. 


e The Student’s ¢t-distribution is based on the fact that the variance coef- 
ficient used for a sample of normal random variables is estimated. If the 
variance parameter is known, then the confidence interval can be con- 
structed using the normal distribution. 


e The posterior distribution of the Bayesian estimator allows for a direct 
construction of the confidence interval based on the data. 
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TABLE 8.4 
Data for Exercise 8-1E 
Obs. 2; Obs x; Obs. 2; 
1 1 11 ot 21 1 
2 0 12. OO 22 1 
3 iL 13 ~O«d 231 
4 1 14 #1 24 «1 
5 0 15 OO 25 «1 
6 1 16 61 26 = 1 
7 1 17 1 27 3200 
8 1 18 1 28 ol 
9 1 19 «#1 29 1 
10 =o 20 «1 301 
8.5 Review Questions 
8-1R. Demonstrate that : 
= S (wi - ny) (8.51) 
i=1 
can be written as 
S? = (x, — 22)* + (a2 — 23)*. (8.52) 
8-2R. Discuss the implications of 8-1R for the term degrees of freedom. 
8-3R. Construct the posterior distribution for the parameter 6 from the 


8.6 
8-1E. 


8-E2. 


8-E3. 


Bernoulli distribution based on a prior of U [0,1]. Assume that T ob- 
servations out of N are positive. (Hint, use the definition of the beta 
distribution.) 


Numerical Exercises 


Compute the confidence interval for the 6 parameter of the Bernoulli 
distribution given in Table 8.4 using the maximum likelihood estimate 
of the standard deviation of the parameter. 


Construct the confidence interval for the mean of each sample using 
the data in Table 8.5. 


Construct the posterior distribution from review question 8-3R using 
the first ten observations in Table 8.4. 
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TABLE 8.5 


Normal Random Variables for Exercise 8-E2 


Sample 
Obs. I 2 3 
1 —2.231 —4.259 1.614 
2  —1.290 —7.846 1.867 
3 —0.317 -—0.131 3.001 
4  —1.509 —5.188 0.174 
5  —1.324 -—5.387 3.009 
6 -—0.396 4.795 —0.148 
7 —2.048 —0.224 —0.110 
8  -—3.089 4.389 2.030 
9 -—0.717 —0.380 2.549 
10 —2.311 -1.008 0.413 
11 3.686 —0.464 —1.384 
12 —1.985 1.577 = 2.313 
13 2.153 -—8.507 —3.697 
14 -—1.205 -6.396 3.075 
15 -3.798 8.004 —1.167 
16 -1.063 1.526 —0.897 
17 =-—0.593 —2.890 0.589 
18 0.213 2.600 2.357 
19 0.175 8.166 —0.005 
20  —1.804  -—1.880 3.101 
21 -—1.566 0.266 —1.223 
22 1.953 1.814 = 1.936 
23 1.045 4.248 —2.907 
24 2.677 2.316 0.622 
25 —6.429 3.454 2.658 
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In general, there are two kinds of hypotheses: one type concerns the form of 
the probability distribution (i.e., is the random variable normally distributed) 
and the second concerns parameters of a distribution function (i.e., what is 
the mean of a distribution?). 

The second kind of distribution is the traditional stuff of econometrics. We 
may be interested in testing whether the effect of income on consumption is 
greater than one, or whether the effect of price on the level consumed is equal 
to zero. The second kind of hypothesis is termed a simple hypothesis. Under 
this scenario, we test the value of a parameter against a single alternative. 
The first kind of hypothesis (whether the effect of income on consumption is 
greater than one) is termed a composite hypothesis. Implicit in this test is 
several alternative values. 

Hypothesis testing involves the comparison between two competing hy- 
potheses, or conjectures. The null hypothesis, denoted Ho, is sometimes re- 
ferred to as the maintained hypothesis. The competing hypothesis to be ac- 
cepted if the null hypothesis is rejected is called the alternative hypothesis. 

The general notion of the hypothesis test is that we collect a sample of 
data X1,---X,. This sample is a multivariate random variable, FE, (refers to 
this as an element of a Euclidean space). If the multivariate random variable 
is contained in space R, we reject the null hypothesis. Alternatively, if the 
random variable is in the complement of the space R, we fail to reject the null 
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hypothesis. Mathematically, 


if X € Rthen reject Ho; 
(9.1) 
if X ¢ Rthen fail to reject Ho. 


The set R is called the region of rejection or the critical region of the test. 
In order to determine whether the sample is in a critical region, we con- 
struct a test statistic T(X). Note that, like any other statistic, T(X) is a 
random variable. The hypothesis test given this statistic can then be written 
as 
T (X) € R= reject Ho; 
(9.2) 
T (X) € R= fail to reject Ho. 


A statistic used to test hypotheses is called a test statistic. 


Definition 9.1. A hypothesis is called simple if it specifies the values of all 
the parameters of a probability distribution. Otherwise, it is called composite. 


As an example, consider constructing a standard ¢ test for the hypothesis 
Ao : 1 = 0 against the hypothesis Hy : p = 2. To do this we will compute the 
t statistic for a sample (say 20 observations from the potential population). 
Given this scenario, we define a critical value of 1.79 (that is, the Student’s 
t value for 19 degrees of freedom at a 0.95 level of confidence). This test is 
depicted graphically in Figure 9.1. If the sample value of the t statistic is 
greater than t* = 1.79, we reject Ho. Technically, t € R, so we reject Ho. 


f(t) 0.45 7 


0.40 + 


-1.50 -1.00 -0.50 0.00 0.50 1.00 1.50 2.00 2.50 3.00 


FIGURE 9.1 
Type I and Type II Error. 
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9.1 Type land Type II Errors 


Whenever we develop a statistical test, there are two potential errors — the 
error of rejecting a hypothesis (typically the null hypothesis or the hypothesis 
of no effect) when it is true versus the error of failing to reject a hypothesis 
when it is false. These errors represent a tradeoff — we can make the first error 
essentially zero by increasing the amount of information (i.e., the level of ¢* 
in Figure 9.1). However, increasing this critical value implies an increase in 
the second error — the probability of failing to reject the hypothesis when it 
is indeed false. 


Definition 9.2. A Type I error is the error of rejecting Ho when it is true. 
A Type II error is the error of accepting Hp when it is false (that is, when Hy, 
is true). 


We denote the probability of Type I error as a and the probability of Type 


II error as 3. Mathematically, 

a=P{[X € R|Ab] 
_ (9.3) 
B=P[X € R\|Ai]. 


The probability of a Type I error is also called the size of a test. 

Assume that we want to compare two critical regions Ry and Rz. Assume 
that we choose either confidence region R, or Rp randomly with probabilities 
6 and 1 — 0, respectively. This is called a randomized test. If the probabilities 
of the two types of errors for R; and Ro are (a1, 81) and (a2, 82), respectively, 
the probability of each type of error becomes 


a=0da;+(1—5d)ag 
B = 6B, + (1—6) Po. 


The values (a, 3) are called the characteristics of the test. 


(9.4) 


Definition 9.3. Let (a1, 81) and (a2, G2) be the characteristics of two tests. 
The first test is better (or more powerful) than the second test if ay < ag, 
and 8; < 6, with a strict inequality holding for at least one point. 


If we cannot determine that one test is better by the definition, we could 
consider the relative cost of each type of error. Classical statisticians typically 
do not consider the relative cost of the two errors because of the subjective 
nature of this comparison. Bayesian statisticians compare the relative cost of 
the two errors using a loss function. 

As a starting point, we define the characteristics of a test in much the 
same way we defined the goodness of an estimator in Chapter 8. 


Definition 9.4. A test is inadmissable if there exists another test which is 
better in the sense of Definition 9.3, otherwise it is called admissible. 
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FIGURE 9.2 
Hypothesis Test for Triangular Distribution. 


Definition 9.5. R is the most powerful test of size a if a(R) = a and for 
any test R, of size a, 8(R) < 6(R1). 


Definition 9.6. R is the most powerful test of level a if for any test R, of 
level @ (that is, such that a(R,) <a), 8(R) < B(R). 


Example 9.7. Let X have the density 


f(a) =1-6+2 ford-l<a<0 
(9.5) 
=1+6-afordé<x<6+1. 


This funny looking beast is a triangular probability density function, as de- 
picted in Figure 9.2. Assume that we want to test Ho : 06 = O against H) :0=1 
on the basis of a single observation of X. 


Type I and Type II errors are then defined by the choice of t, the cut off 
region 


a=-=(1-t) 


Nl] re 


(9.6) 
1 


ee oe 
ie 2 


Specifically, assume that we define a sample statistic such as the mean of the 
sample or a single value from the distribution (x). Given this statistic, we fail 
to reject the null hypothesis (Hp : 6 = 0) if the value is less than t. Alter- 
natively, we reject the null hypothesis in favor of the alternative hypothesis 
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NI 


FIGURE 9.3 
Tradeoff of the Power of the Test. 


(H, : 0 = 1) if the statistic is greater than t. In either case, we can derive the 
probability of the Type I error as the area of the triangle formed starting at 
t = 1 to the point t. Similarly, we can derive the Type II error starting from 
the origin (t = 0) to the point t. 

Further, we can derive ( in terms of a, yielding 


B= 5 (1-v2a) (9.7) 


Specifically, Figure 9.3 depicts the relationship between the Type I and Type 
II error for the hypothesis derived in Equation 9.7. 


Theorem 9.8. The set of admissible characteristics plotted on the a, 8 plane 
is a continuous, monotonically decreasing, convex function which starts at a 
point with [0,1] on the 8 axis and ends at a point within the [0,1] on the a 
axis. 


Note that the choice of any t yields an admissible test. However, any random- 
ized test is inadmissible. 


Da 


9.2 Neyman—Pearson Lemma 


How does the Bayesian statistician choose between tests? The Bayesian 
chooses between the test Ho and H, based on the posterior probability of 
the hypotheses: P (Hp|X) and P (|X). Table 9.1 presents the loss matrix 
for hypothesis testing. 

The Bayesian decision is then based on this loss function. 


Reject A if iP (Ho|X) < 7y2P (A,|X) : (9.8) 
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TABLE 9.1 
Loss Matrix in Hypothesis Testing 


State of Nature 


Decision Ho Ay 
Ho 0 V1 
Ay 2 0 


The critical region for the test then becomes 
Ro = {x |y1P (Ho|x) < y2P (Ai|x) } . (9.9) 


Alternatively, the Bayesian problem can be formulated as that of determining 
the critical region R in the domain X so as to 


min ¢é(R) =P (Ao|X € R)P(X € R) 


7 - (9.10) 
+P (Mi|X € R)P (XE R). 
We can write this expression as 
$(R) =P (Ao) P(R| Ao) + y2P (1) P (RIAL) 
= moa (R) + m8 (R) 

(9.11) 

m0 = 11P (Ho) 

m = y2P (HM). 


Choosing between admissible test statistics in the (a, 3) plane then becomes 
like the choice of a utility maximizing consumption point in utility theory. 
Specifically, the relative tradeoff between the two characteristics becomes 


—no/m: 
The Bayesian optimal test Ro can then be written as 
D(x\fi) — no \ 
Ro = $2 >>. 9.12 
= {2| Fei > oe 


Theorem 9.9 (Neyman—Pearson Lemma). Jf testing Ho : 0 = 09 against 
A, :0=0,, the best critical region is given by 


R= {2 aa > cf (9.13) 


where L is the likelihood function and c (the critical value) is determined to 
satisfy 
P(R|#) =a (9.14) 


provided c exists. 
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FIGURE 9.4 
Optimal Choice of Type I and Type II Error. 


Theorem 9.10. The Bayes test is admissible. 


Thus, the choice of Type I and Type IJ error is depicted in Figure 9.4. 


9.3. Simple Tests against a Composite 


Mathematically, we now can express the tests as testing between Ho : 6 = 6 
against H, : 6 € ©,, where 0, is a subset of the parameter space. Given this 
specification, we must modify our definition of the power of the test because 
the 6 value (the probability of accepting the null hypothesis when it is false) 
is not unique. In this regard, it is useful to develop the power function. 


Definition 9.11. If the distribution of the sample X depends on a vector of 
parameters 0, we define the power function of the test based on the critical 
region R by 

Q (6) =P(X € Rj0). (9.15) 


Definition 9.12. Let Qi (0) and Q2 (0) be the power functions of two tests, 
respectively. Then we say that the first test is uniformly better (or uniformly 
most powerful) than the second in testing Ho : 6 = 09 against H, : 0 € ©, if 
Q1 (90) = Q2 (Go) and Q1 (8) 2 Qs (A) for all 6 € ©; and Qi (A) > Qz2 (8) for 
at least one 6 € Qj. 
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Definition 9.13. A test R is the uniformly most powerful (UMP) test of size 
(level) a for testing Hp : 0 = 6 against H, : 6 € ©, if P(R|0)) = (<)a and 
any other test Ry, such that P (R|6o) = (<)a, we have P (R|@) > P (R,|0) for 
any 0 € O,. 


Definition 9.14. Let L(z|@) be the likelihood function and let the null and 
alternative hypotheses be Ho : @ = 69 and Hy; : 6 € ©;, where Oj is a subset 
of the parameter space 0. Then the likelihood ratio test of Ho against Hy is 
defined by the critical region 
L (|x) 
A= sup be) <c (9.16) 
O9UO1 


where c is chosen to satisfy P (A < c|Ho) = a for a certain value of a. 


Example 9.15. Let the sample be X; ~ N (u, 07), i = 1,2,+--n where o? is 
assumed to be known. Let x; be the observed value of X;. Testing Ho : 4 = Uo 
against Hy : ps > fo, the likelihood ratio test is to reject Ho if 


<i (9.17) 


= ae: 
sup exp si s (a4 — 0 


b> Ho 


Assume that we had the sample X = {6,7,8,9,10} from the preceding 
example and wanted to construct a likelihood ratio test for ps > 7.5. 


1 2, 2, — 2 
| 3 ({6 7.5/2 + [7 — 7.5]? +--+ (10 — 7.5] ) 


exp ls ((6-s? + (r-s/? +---W0-g7)| (9.18) 


__ exp (—2.5000) 


= ——__— _ = 0.7574. 
exp (—2.2222) 


where 8 is the maximum likelihood estimate of js, assuming a standard devi- 
ation of 1.5 yields a likelihood ratio of 0.7574. 


Theorem 9.16. Let A be the likelihood ratio test statistic. Then —2\n(A) is 
asymptotically distributed as chi-squared with the degrees of freedom equal to 
the number of exact restrictions implied by Ho. 


Thus, the test statistic from Equation 9.18 is 0.5556, which is distributed y?. 
The probability of drawing a test statistic greater than 0.5556 is 0.46, so we 
fail to reject the hypothesis at any conventional level of significance. 
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9.4 Composite against a Composite 


Testing a simple hypothesis against a composite is easy because the numerator 
has a single value — the value of the likelihood function evaluated at the 
single point 09. Adding a layer of complexity, consider a test that compares 
two possible ranges. For example, assume that we are interested in testing 
the hypothesis that the demand is inelastic. In this case we would test the 
hypothesis that —1 < 7 < 0 with the valid range of demand elasticities for 
normal goods 7 < 0. In this case we compare the likelihood function for the 
restricted range (—1 < 7 < 0) with the general range 7 < 0. 


Definition 9.17. A test R is the uniformly most powerful test of size 
(level) a if supgeg P(R|O) = (<)a and for any other test R, such that 
SsUPpco P (Ril) = (<)a we have P(R|A) > P (R19) for any 6 € O. 


Definition 9.18. Let L (x|@) be the likelihood function. Then the likelihood 
ratio test of Ho against Hj is defined by the critical region 


| eg (9.19) 


where c is chosen to satisfy supe P (A < c|@) for a certain specified value of a. 


Example 9.19. Let the sample X; ~ N (y,07) with unknown o7, i = 
1,2,-+-n. We want to test Ho : u = fo and 0 < o? < o6 against Hy : w > po 
and 0 < a? < oo. 


L [6] = (2n)~"/? (0?) -"? exp -zs 2 (a; — X . (9.20) 


Using the concentrated likelihood function at the null hypothesis, 


sup L (0) = (20)? (a) mae Es 


©o 
n (9.21) 
a* — = Gpaiioy. 


i=l 


The likelihood value can be compared with the maximum likelihood value of 


sup L (0) = (2x)? 2)" exp [2] 


(S) 
= — Si (ei = 2) (9.22) 
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The critical region then becomes 
=2\ —3 
(=) <c. (9.23) 
a 


Turning back to the Bayesian model, the Bayesian would solve the problem 
testing Hp : 6 < 6 against H; : 6 > @. Let L2 (6) be the loss incurred by 
choosing Hp and Lj, (0) be the loss incurred by choosing H,. The Bayesian 
rejects Ho if 


ie Li (9) f (|x) a0 < /- Lo (6) f (Bla) dd (9.24) 


where f (6|x) is the posterior distribution of 0. 


Example 9.20 (Mean of a Binomial Distribution). Assume that we want 
to know whether a coin toss is biased based on a sample of ten tosses. Our 
null hypothesis is that the coin is fair (Ho : p = 1/2) versus an alternative 
hypothesis that the coin toss is biased toward heads (Hj : p > 1/2). Assume 
that you tossed the coin ten times and observed eight heads. What is the 
probability of drawing eight heads from ten tosses of a fair coin? 


10 10 10 
Pindsi=( 3) )pa-+( 9 )ea-9'+( 2) a-p?. 
(9.25) 
If p = 1/2, P[n > 8] = 0.054688. Thus, we reject Ho at a confidence level of 
0.10 and fail to reject Ho at a 0.05 confidence level. 
Moving to the likelihood ratio test, 


_ 0.5° (1 —0.5)? 
0.88 (1 — 0.8)” 


p=05 


eee OG: (9.26) 


= 0.1455 = { 7 
P 


Given that 
—2In(A)~ x3 (9.27) 


we reject the hypothesis of a fair coin toss at a 0.05 confidence level. 
—2In(A) = 3.854 and the critical region for a chi-squared distribution at 
one degree of freedom is 3.84. 


Example 9.21. Suppose the heights of male Stanford students are distributed 
N (u, a”) with a known variance of 0.16. Assume that we want to test whether 
the mean of this distribution is 5.8 against the hypothesis that the mean of the 
distribution is 6. What is the test statistic for a 5 percent level of confidence 
and a 10 percent level of confidence? Under the null hypothesis, 


— 0.16 
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The test statistic then becomes 


_ 6-58 


= —— =1.58~ N(0,1). 2 
0.1265 ag 02) 2) 


Given that P[Z > 1.58] = 0.0571, we have the same decisions as above, 
namely, that we reject the hypothesis at a confidence level of 0.10 and fail 
to reject the hypothesis at a confidence level of 0.05. 


Example 9.22 (Mean of Normal with Variance Unknown). Assume the same 
scenario as above, but that the variance is unknown. Given the estimated 
variance is 0.16, the test becomes 


(X-6) 
ore ~ 
V10 


The computed statistic becomes P [tg > 1.58] = 0.074. 


(9.30) 


Example 9.23 (Differences in Variances). In Section 8.1, we discussed the 
chi-squared distribution as a distribution of the sample variance. Following 
Theorem 8.5, let X,, X2,---Xy be a random sample from a N (u, a”) distri- 
bution, and let 


ey: 1 5 
X= a x and Ss? = a (9.31) 


_ = 2 
Then X and §$? are independent random variables, X ~ N (u rr), and 


2 
eS: has a chi-squared distribution with n — 1 degrees of freedom. 


g 
Given the distribution of the sample variance, we may want to compare 
two sample variances, 


nx S2 ny S? 
5 Et eg Ny and 5 Lae Ne ats (9.32) 
ox Oy 


Dividing the first by the second and correcting for degrees of freedom yields 


~F -1 — 1): 9.33 
ea rear (nx —1,ny — 1) (9.33) 


This statistic is used to determine the statistical significance of regressions. 
Specifically, let se be the standard error of a restricted model and Se be the 
standard error of an unrestricted model. The ratio of the standard errors be- 
comes a test of the restrictions. However, the test actually tests for differences 
in estimated variances. 
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9.5 Testing Hypotheses about Vectors 


Extending the test results beyond the test of a single parameter, we now want 
to test Ho : 0 = 0 against H, : 0 4 09 where @ is a k x 1 vector of parameters. 
We begin by assuming that 

6~N(6,%) (9.34) 


where » is a known variance matrix. 
First, assuming that k = 2, we have 


6, 01 O11 O12 
ss ~N ; : 9.35 
Ca )-™[(8) Coe oe oe 
A simple test of the null hypothesis, assuming that the parameters are uncor- 
related, would then be 
, 2 
(0) 


_ (4-01) : 
O11 022 


>. (9.36) 


Building on this concept, assume that we can design a matrix A such that 
AXA’ = I. This theorem relies on the eigenvalues of the matrix (see Chapter 
10). Specifically, the eigenvalues of the matrix (A) are defined by the solution 
of the equation det (© — JA) = 0. These values are real if the © matrix is 
symmetric, and positive if the © matrix is positive definite. In addition, if the 
matrix is positive definite, there are k distinct eigenvalues. Associated with 
each eigenvalue is an eigenvector u, defined by u( — JA) = 0. Carrying the 
eigenvector multiplication through implies AX — AA = 0 where A is a matrix 
of eigenvectors and A is a diagonal matrix of eigenvalues. By construction, the 
eigenvectors are orthogonal so that AA’ = I. Thus, AXA’ = A. The above 
decomposition is guaranteed by the diagonal nature of A. 

This transformation implies 


Re (40 — a0)’ (46 — a8) Se 


Zs [4 (6-0)] [4 (6-0)| = (6-0) a'A (6-0) (9.37) 


Note that the likelihood ratio test for this scenario becomes 


exp |-5 (6-6) 21 (6-0)| 


max exp |-5 (ure oa ¢) 7 (Oxr.2 - 6) 


OmMLE 2 


A= 


(9.38) 


Testing Hypotheses 211 


Given that 6 MLE goes to 0, the numerator of the likelihood ratio test becomes 
one and 


. ! i 
(4 = 6) yl (4 = 6) ~ x2. (9.39) 
A primary problem in the construction of these statistics is the assumption 


that we know the variance matrix. If we assume that we know the variance 
matrix to a scalar (© = 07Q where Q is known and o? is unknown), the test 


becomes , 
(a) (a) 
5 >. (9.40) 
ol 
Using the traditional chi-squared result, 
WwW 
=~ Xu (9.41) 
or 
dividing Equation 9.40 by Equation 9.41 yields 
‘: / . 
(4 = 60 ) qr (4 = 00) 
Ww ~F(K,M). (9.42) 
M 


9.6 Delta Method 


The hypotheses presented above are all linear — Hp : 8 = 2 or in vector space 
Ao : 28); + G2 = 0. For the test in vector space we have 


261 Rae aad 
| bo | ‘ | 3 | a el 


A little more complex scenario involves the testing of nonlinear constraints or 
hypotheses. 

Nonlinearity in estimation may arise from a variety of sources. One ex- 
ample involves the complexity of estimation, as discussed in Chapter 12. One 
frequent problem involves the estimation of the standard deviation instead 
of the variance in maximum likelihood. Specifically, using the normal den- 
sity function to estimate the Cobb-Douglas production function with normal 
errors yields 


N 1 a1, 1-a 
La 5 In [07] ys (yi — apne} 25; 1)? : (9.44) 


This problem is usually solved with iterative techniques as presented in Ap- 
pendix E. These techniques attempt to improve on an initial guess by com- 
puting a step (i.e., changes in the parameters o?, ag, and a;) based on the 
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derivatives of the likelihood function. Sometimes the step takes the parame- 
ters into an infeasible region. For example, the step may cause a? to become 
negative. As a result, we often estimate the standard deviation rather than 
the variance (i.e., 7 instead of 0”). No matter the estimate of o (i-e., negative 
or positive), the likelihood function presented in Equation 9.44 is always valid 
(ie., 6? > 0). The problem is that we are usually interested in the distribution 
of the variance and not the standard deviation. 

The delta method is based on the first order Taylor series approximation 


23 ef BigP) ; 
9(8)=9(8) + cae (6-8). (9.45) 
Resolving Equation 9.45, we conjecture that 
Bvt tates sO a3) a; 
Jim, [9 (8) - 9 (4)] = Jim, ae (6-8) =0 (9.46) 


Using the limit in Equation 9.46, we conclude that 


V(9(8) = eo | Zp 90 (9.47) 


where Xg is the variance (or variance matrix) for the 8 parameter(s). 

In our simple case, assume that we estimate s (the standard deviation). The 
variance of the variance is then computed given that o? = s?. The variance of 
o” is then 2787%,. 


a Ee 


9.7 Chapter Summary 


e The basic concept in this chapter is to define regions for statistics such 
that we can fail to reject or reject hypotheses. This is the stuff of previous 
statistics classes — do we reject the hypothesis that the mean is zero based 
on a sample? 


e Constructing these regions involves balancing two potential errors: 


— Type I error — the possibility of rejecting the null hypothesis when it 
is correct. 


— Type II error — the possibility of failing to reject the null hypothesis 


when it is incorrect. 


e From an economic perspective there is no free lunch. Decreasing Type I 
error implies increasing Type II error. 
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e A simple hypothesis test involves testing against a single valued alternative 


hypothesis — Hp : uw = 2. 


e A complex hypothesis involves testing against a range of alternatives Ho : 


pe € [0, 1]. 


9.8 
9-1R. 


9-2R. 


9.9 
9-1E. 


9-2E. 
9-3E. 


9-4E. 


Review Questions 


Given the sample s = {6.0, 7.0, 7.5, 8.0, 8.5, 10.0}, derive the likelihood 
test for Ho : yw = 7.0 versus Hy : «4 = 8.0 assuming normality with a 


variance of a? = 2. 


Using the sample from review question 9-1R, derive the likelihood test 
for Ho : 4 = 7.0 versus Hy : 4 # 7.0 assuming normality with a known 


variance of o? = 2. 
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Numerical Exercises 


Consider the distribution functions 


f(a) = 5 (1-(e+1)") 2 € [-2,0] 


Shy? (9.48) 
g(a) = 4 (3 Zt) « [-1,5] 


where the mean of f (x) is —1 and g (x) is 2. What value of T defined 
as a draw from the sample gives a Type I error of 0.10? What is the 
associated Type II error? 


Using the data in Table 8.4, test Hp : 0 = 0.50 versus Hy : 6 = 0.75. 


Using the data in Table 8.5, test whether the mean of sample 1 
equals the mean of sample 3. 


Using the sample from review question 9-1R, compute the same test 
with an unkown variance. 
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Econometric Applications 
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Many of the traditional econometric applications involve the estimation of 
linear equations or systems of equations to describe the behavior of individuals 
or groups of individuals. For example, we can specify that the quantity of an 
input demanded by a firm is a linear function of the firm’s output price, the 
price of the input, and the price of other inputs 


ae = Ag + Ape + AQWi1t + AQWot + A3W3e + Et (10.1) 


where a? is the quantity of the input demanded at time t, p; is the price of 
the firm’s output, wi; is the price of the input, wo; and wz; are the prices 
of other inputs used by the firm, and «, is a random error. Under a variety 
of assumptions such as those discussed in Chapter 6 or by assuming that 
e. ~ N (0;67), Equation 10.1 can be estimated using matrix methods. For 
example, assume that we have a simple linear specification 


Y= Ap +01%;,+ &. (10.2) 


Section 7.6 depicts the derivation of two sets of normal equations. 


ee ee 
ay ere y aa? 


(10.3) 
ioe oe ier 
S S S 2% 
7 = XLiYi + Coy A XG + 1 var LS 0. 
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Remembering our discussion regarding sufficient statistics in Section 7.4, we 
N N N 
defined T= 1/N D7. tis te = TN DY es Tey = UN D3 eign and 
Tre = 1/N Se cs x?. Given these sufficient statistics, we can rewrite the nor- 
mal equations in Equation 10.3 as a linear system of equations. 
Ty =aAg+ aT, 


(10.4) 
Try = aol, + alee: 


The system of normal equations in Equation 10.4 can be further simplified 


into a matrix form 
Ty. )| Lee ties ag 
[a I-[2 m][2] 05) 


Further, we can solve for the set of parameters in Equation 10.5 using some 
fairly standard (and linear) operations. 

This linkage between linear models and linear estimators has led to a his- 
torical reliance of econometrics on a set of linear estimators including ordinary 
least squares and generalized method of moments. It has also rendered the 
Gauss—Markov proof of the ordinary least squares (the proof that orindary 
least squares is the Best Linear Unbiased Estimator (BLUE)) an essen- 
tial element of the econometrician’s toolbox. This chapter reviews the basic 
set of matrix operations; Chapter 11 provides two related proofs of the BLUE 
property of ordinary least squares. 


10.1 Review of Elementary Matrix Algebra 


It is somewhat arbitrary and completely unnecessary for our purposes to draw 
a sharp demarkation between linear and matrix algebra. To introduce matrix 
algebra, consider the general class of linear problems similar to Equation 10.4: 


Y= 210 7 4121 1 A222 T A373 


Y2 = A290 + 21X11 + AQ2%2 + A123%3 


(10.6) 
Y3 = 30 + 13171 + A32%2 + A33%3 


Ya = Ago + 41 Ly + AggLQ + A4g%4. 


In this section, we develop the basic mechanics of writing systems of equations 
such as those depicted in Equation 10.6 as matrix expressions and develop 
solutions to these expressions. 


10.1.1 Basic Definitions 


As a first step, we define the concepts that allow us to write Equation 10.6 as 
a matrix equation. 
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Matrices and Vectors 


A matrix A of size m x n is an m x n rectangular array of scalars: 


ai1 Q@12  *** Gin 
a21 a22 °°" a2n 

A=| - noe malik (10.7) 
Aml1 QAm2 eos Amn 


Q1i0 O11 212 13 
a2 Q21 22 ag 
A= . an te (10.8) 
30 31 32 33 
Q40 O41 42 43 


It is sometimes useful to partition matrices into vectors. 


aii Qin 
a21 a2n 
A=[a1 ag te. Sane On = (10.9) 
Aml1 amn 
a4. ay. = [ 411 412 *** Gin | 
a2. a2. = [ a21 422 **: Gan | 
7 ann (10.10) 
am. ay. = [ Gm1 Am2 °*° Amn ] 


Operations of Matrices 
The sum of two identically dimensioned matrices can be expressed as 
A+B= [ai + bi 5] , (10.11) 


In order to multiply a matrix by a scalar, multiply each element of the matrix 
by the scalar. In order to discuss matrix multiplication, we first discuss vector 
multiplication. Two vectors « and y can be multiplied together to form z 
(z = x-y) only if they are conformable. If x is of order 1 x n and y is of order 
n x 1, then the vectors are conformable and the multiplication becomes 


=a = D2 an (10.12) 
i=1 


Extending this discussion to matrices, two matrices A and B can be multiplied 
if they are conformable. If A is of order k x n and B is of order n x 1 then 
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the matrices are conformable. Using the partitioned matrix above, we have 


a4. 
a9. 
C=AB=|/ ~ |[bi ba - bi] 
Ak. 
(10.13) 

ay.b.4 ay.b.9 sree ay4.b. 
Q9.b.4 a9.b.9 Mie be a9.b.1 
Ap. 0.4 Ap. b.9 ee Ap. b.1 


These mechanics allow us to rewrite the equations presented in Equation 10.6 
in true matrix form. 


Y1 A190 A 12 13 1 
Qa a a ay x 
y= Yo) ss 20 21 22 23 Le | Ag. (10.14) 
¥3 30 «831 O32 33 v2 
YA 40 41 42 43 v3 


Theorem 10.1 presents some general matrix results that are useful. Basi- 
cally, we can treat some matrix operations much the same way we treat scalar 
(i.e., single number) operations. The difference is that we always have to be 
careful that the matrices are conformable. 


Theorem 10.1. Let a and 8 be scalars and A, B, and C be matrices. Then 
when the operations involved are defined, the following properties hold: 


a) A+ B=B+A 
b) (A+ B)+C=A+(B+C) 
c) a(A+B)=aA+aB 

d) (a+ 8)=aA+ 6B 

e) A~-A=A+(-A) =(0] 
f) A(B+C) = AC+ BC 
g) (A+ B)C =AC+BC 

h) (AB)C = A(BC) 


The transpose of an m xn matrix is an Xm matrix with the rows and columns 
interchanged. The transpose of A is denoted A’. 


Theorem 10.2. Let a and £ be scalars and A and B be matrices. Then when 
defined, the following hold: 
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a) (aA)! = aA’ 
b) (AY = 

c) (aA+ BB) =aA'+ BB’ 
d) (AB) = B'A' 


Traces of Matrices 


The trace is a function defined as the sum of the diagonal elements of a square 
matrix. Fa 
= aii. (10.15) 
i=1 


Theorem 10.3. Let a be scalar and A and B be matrices. Then when the 
appropriate operations are defined, we have: 


a) tr(A’) = tr(A) 
b) tr (aA) = atr (A) 
c) tr(A+ B) = tr(A) + tr(B) 
d) tr (AB) = tr(B’A’) 

(A’ 


e) tr (A’A) = 0 if and only if A = [0] 


Traces can be very useful in statistical applications. For example, the natural 
logarithm of the normal distribution function can be written as 


An (fu, Q) = — 5mm in (27) — 5rin (|2) - str (Q71Z) 
(10.16) 


Z= i=" (i -— 2) i w)- 
Determinants of Matrices 


The determinant is another function of square matrices. In its most technical 
form, the determinant is defined as 


|A| = oS nyse Q1i, 22g °° * Omi, 
= S° Gay eeincts) iy 1Qin2*** Vinm 


where the summation is taken over all permutations (71, i2,---tm) of the set 
of integers (1,2,---m), and the function f (1, %2,-++%m) equals the number of 
transpositions necessary to change (¢1,i2,-+- tm) to (1,2,---m). 


(10.17) 
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In the simple case of a 2 x 2, we have two possibilities, (1,2) and (2,1). 
The second requires one transposition. Under the basic definition of the de- 
terminant, 


|A] = (-1)° a11a22 + (-1)' arzaar. (10.18) 


In the slightly more complicated case of a 3 x 3, we have six possibilities, 
(1, 2,3), (2,1,3), (2,3,1), (3,2,1), (3,1,2), (1,3,2). Each one of these differs 
from the previous one by one transposition. Thus, the number of transpositions 
is 0, 1, 2, 3, 4, 5. The determinant is then defined as 


|A| = (-1)° 11422033 + (-1)' 412421433 + (-1) 412093031 


Als (=1)? 413022031 (ay 413421432 + (i 411423032 
(10.19) 


= 411422433 — 12421033 + 412423431 — 413422431 
+ 13421432 — 411423032. 


A more straightforward definition involves the expansion down a column 
or across the row. In order to do this, I want to introduce the concept of 
principal minors. The principal minor of an element in a matrix is the matrix 
with the row and column of the element removed. The determinant of the 
principal minor times negative one raised to the row number plus the column 
number is called the cofactor of the element. The determinant is then the sum 
of the cofactors times the elements down a particular column or across the 
row. 


|Al =) aiyAy = > ay (yi miy| : (10.20) 
j=l 


In the 3 x 3 case 


141] G22 423 2+1) 412 413 
|A| = ay, (—1) + agi (—1) 
a32 433 a32 433 
(10.21) 
341] 412 413 
+ 431 (—1) 
a22 423 
Expanding this expression yields 
|A| = 411422033 — A11423032 — 412421033 + 413021432 
(10.22) 


+ 412023431 — 413022431. 


Theorem 10.4. Ifa is a scalar and A is anmxm matriz, then the following 
properties hold: 


a) |A"| =|Al 
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b) |aA| =a™ |A| 
c) If A is a diagonal matriz |A| = a11022-+:Amm 
d) If all the elements of a row (or column) of A are zero then |A| = 0 
e) If two rows (or columns) of A are proportional then |A| = 0 
f) The interchange of two rows (or columns) of A changes the sign of |A| 


g) If all the elements of a row (or a column) of A are multiplied by a then 
the determinant is multiplied by a 


h) The determinant of A is unchanged when a multiple of one row (or col- 
umn) is added to another row (or column) 
The Inverse 


Any m xm matrix A such that |A| 4 0 is said to be a nonsingular matrix and 
possesses an inverse denoted A7!. 


AAS AAG (10.23) 


Theorem 10.5. Ifa is a nonzero scalar, and A and B are nonsingular mxm 
matrices, then: 


a) (aA)~* =a-tA-* 
b) (AN? = (4-4)! 
c) (A) "=A 
d) |A“"| = |A|~* 
e) If A= diag (a11,*++Gmm) then A~! = diag (aj',---a;1,) 
f) If A= A’ then A~! = (A71)’ 
g) (AB)"} = BoA“ 
The most general definition of an inverse involves the adjoint matrix (de- 


noted Ay). The adjoint matrix of A is the transpose of the matrix of cofactors 
of A. By construction of the adjoint, we know that 


AAy = AyA = diag (|A|,|A],---|A]) = |Al Im. (10.24) 
In order to see this identity, note that 


a;.b., = |A| where B = Ay 
(10.25) 
a;.b4 =0 where B= ApiF¥ j. 
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Focusing on the first point, 


[AAx],, = [ G11 G12 413 | 


a31 233 
Se ee 10.26 
(-1) Ge aes (10.26) 
Saqylel. a22 423 ay tae a21 423 
es) - 432 433 ia) a 431 433 
+(-1)' @ 421 422 | _ Al. 
(1) ayy] OF 2? | — |] 
Given this expression, we see that 
A+ =|A|“* Ag. (10.27) 


A more applied view of the inverse involves row operations. For example, 
suppose we are interested in finding the inverse of a matrix 


195 
A=|3 7 8]. (10.28) 
23 5 


As a first step, we form an augmented solution matrix with the matrix we 
want on the left-hand side and an identity on the right-hand side, as depicted 
in Equation 10.29. 


(10.29) 


Next, we want to derive a sequence of elementry matrix operations to trans- 
form the left-hand matrix into an identity. These matrix operations will leave 
the inverse matrix in the right-hand side. From the matrix representation in 
Equation 10.29, the first series of operations is to subtract 3 times the first 
row from the second row and 2 times the first row from the third row. The 
elemental row operation to accomplish this transformation is 


1 0 0 
-3 1 0]. (10.30) 
—2 0 1 


Multiplying Equation 10.29 by Equation 10.30 yields 
1 9 5 |} 1 0 0 
0 -—20 -7/-3 1 0 (10.31) 
0 -15 -5/-2 0 1 
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which implies the next elemental transformation matrix 


1 9/20 0 
0 -1/20 0 |. (10.32) 
0 -15/20 1 


Multiplying Equation 10.31 by Equation 10.32 yields 


1 0 —37/5|-—7/20 9/20 0 
0 I. S75), 8/90". 120.76 (10.33) 
00 1/4 | 1/4 -3/4 1 


which implies the final elementry operation matrix 


1 0 37/5 
0 1 —7/5 : (10.34) 
0 0 4 
The final result is then 
1 0 0] -11/5 6 —837/5 
0 1 0} —1/5 1 —7/5 (10.35) 
001 1 —3 4 


so the inverse matrix becomes 


1175". “Bs 37/5 
Ata) ais. 4. ys | (10.36) 
he ag tA 


Checking this result, 


195 -11/5 6 87/5 
3.7 8 -1/5 1 -7/5 |= 
235 f° ~s  a 
(10.37) 
-11/5 6 —837/5 ie a 10 0 
1/51, A7/5 3 7 8/=|0 1 0 
rs | 235 0.04 


Rank of a Matrix 


The rank of a matrix is the number of linearly independent rows or columns. 
One way to determine the rank of any general matrix m x n is to delete rows 
or columns until the resulting r x r matrix has a nonzero determinant. What 
is the rank of the above matrix? If the above matrix had been 


19 5 
A=|(|3 7 8 (10.38) 
4 16 13 
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note |A| = 0. Thus, to determine the rank, we delete the last row and column, 
leaving 
1 9 
A=(4 7) + lAi] = 7-27 = 20 (10.39) 
The rank of the matrix is 2. 

The rank of a matrix A remains unchanged by any of the following opera- 
tions, called elementary transformations: (a) the interchange of two rows (or 
columns) of A, (b) the multiplication of a row (or column) of A by a nonzero 
scalar, and (c) the addition of a scalar multiple of a row (or column) of A to 
another row (or column) of A. 

For example, we can derive the rank of the matrix in Equation 10.38 using 
a series of elemental matrices. As a starting point, consider the first elementary 
matrix to construct the inverse as discussed above. 


1, <0" 0 19 5 ae ae 
a3) +O BF Bs |S 20 SF 4 (10.40) 
4 i Al 4 16 13 C90 <% 


It is obvious in Equation 10.40 that the third row is equal to the second row 
so that 


1 0 0 it 2G. de =0a* 5 
0 1 0 Oy 20: Se SOs -200- r (10.41) 
Olt a O 300.27 0 0 0 


Hence, the rank of the original matrix is 2 (there are two leading nonzero 
elements in the reduced matrix). 


Orthogonal Matrices 


An mx 1 vector p is said to be a normalized vector or a unit vector if p’p = 1. 
The mx 1 vectors p1,p2,--- Pn where n is less than or equal to m are said to be 
orthogonal if pip; = 0 for alli not equal to j. If a group of n orthogonal vectors 
is also normalized, the vectors are said to be orthonormal. An m x m matrix 
consisting of orthonormal vectors is said to be orthogonal. It then follows 


P'P=I. (10.42) 


It is possible to show that the determinant of an orthogonal matrix is either 
lor —1. 


Quadratic Forms 
In general, a quadratic form of a matrix can be written as 
m m 
a’ Ay = ys ye Lay jij. (10.43) 
i=1 j=l 


We are most often interested in the quadratic form 2’ Az. 
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Every matrix A can be classified into one of five categories: 


a) If 2’ Ax > 0 for all x £0, then A is positive definite. 


b) If 2’ Ax > 0 for all a £0 and x’ Ax = 0 for some x # 0, then A is positive 
semidefinite. 


c) If 2’ Ax < 0 for all « £0, then A is negative definite. 


d) If 2’ Ax < 0 for all a £0 and x’ Ax = 0 for some x # 0, then A is negative 
semidefinite. 


e) If 2’ Ax > 0 for some x and 2’ Ax < 0 for some x, then A is indefinite. 


Positive and negative definiteness have a wide array of applications in econo- 
metrics. By its very definition, variance matrices are positive definite. 


LE 
(«— 2) > Sp = Mae (10.44) 


where x is a vector of random variables, Z is the vector of means for those 
random variables, and /,, is the variance matrix. Typically, we write the 
sample as 


T11—-—T1 = L12—-—FZ2 +++ Lip — 2X4 
Hq1- 4, Lo2—-TQ +++ Lor — Ly 

Ge ; ; (10.45) 
UnN1-—%1 LN2—-XQ +++ LNr—@Fy 


Therefore Z’Z/N = Sy. is an r Xr matrix. Further, we know that the matrix 
is at least positive semidefinite because 


a’ (Z'Z) a (Zax) (Zr) > 0. (10.46) 


Essentially, Equation 10.46 is the matrix equivalent to squaring a scalar num- 
ber. 


10.1.2 Vector Spaces 


To develop the concept of a vector space, consider a simple linear system: 


Y1 5 3 2 Ly 
yo |=|4 1 6 ra |. (10.47) 
Y3 9 4 8 x3 


Consider a slight reformulation of Equation 10.47: 


Y1 5 3 2 
Y2 = 4 t+ 1 ®oa+ 7 X3. (10.48) 
y3 9 4 8 


228 Mathematical Statistics for Applied Econometrics 


\o 
i 
I 


SF NY Wwe uN An 


FIGURE 10.1 
Vector Space. 


This formulation is presented graphically in Figure 10.1. The three vectors in 
Equation 10.48 represent three points in a three-dimensional space. The ques- 
tion is what space does this set of three points span — can we explain any point 
in this three-dimensional space using a combination of these three points? 


Definition 10.6. Let S be a collection of m x 1 vectors satisfying the follow- 
ing: (1) a1 € S and a2 € S, then 1 +22 € S and (2) if x € S and ais a real 
scalar then ax € S. Then S is called a vector space in m-dimensional space. 
If S is a subset of 7’, which is another vector space in m-dimensional space, 
S is called a vector subspace of T’. 


Definition 10.7. Let {x1,---2,} be a set of mx 1 vectors in the vector space 
S. If each vector in S$ can be expressed as a linear combination of the vectors 
X1,°++%n, then the set {x1,---xp} is said to span or generate the vector space 
S, and {21,---Xn} is called a spanning set of S. 


Linear Independence and Dependence 


At most the set of vectors presented in Equation 10.48 can span a three- 
dimensional space. However, it is possible that the set of vectors may only 
span a two-dimensional space. In fact, the vectors in Equation 10.48 only 
span a two-dimensional space — one of the dimensions is linearly dependent. 
If all the vectors are linearly independent, then the vectors would span a 
three-dimensional space. Taking it further, it is even possible that all three 
points lie on the same ray from the origin, so the set of vectors only spans 
a one-dimensional space. More formally, we state the conditions for linear 
independence in Definition 10.8. 
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Definition 10.8. The set of m x 1 vectors {%1,--++2,} is said to be linearly 
independent if the only solution to the equation 


Saini = 0 (10.49) 
i=1 


is the zero vector a] =:::=Q, = 0. 


Going back to the matrix in Equations 10.40 and 10.41, 


19 5 1 0 37/20 
3 8 8 }s[o01 7/20 }. (10.50) 
4 16 13 00 0 


This reduction implies that 


1 9 5 
a + - 7 )= | 8 (10.51) 
4 16 13 


or that the third column of the matrix is a linear combination of the first two. 
Orthonormal Bases and Projections 


Assume that a set of vectors {11,---2,} forms a basis for some space S in R™ 
space such that r < m. For mathematical simplicity, we may want to forms 
an orthogonal basis for this space. One way to form such a basis is the Gram— 
Schmit orthonormalization. In this procedure, we want to generate a new set 
of vectors {yi,--- yr} that are orthonormal. The Gram—Schmit process is 


Y= 
/ 
Y2 = TQ — wee Y1 
YYI (10.52) 
L341 L5Y2 


Y3 = U3 r Yi 7 
Wy Y2u2 
which produces a set of orthogonal vectors. Then the set of vectors z; defined 
as 


fae (10.53) 
V UiYi 
spans a plane in three-dimensional space. Setting y: = 2 (from Equa- 


tion 10.48), y2 is derived as 


(9 7 16) 


(1 3 4) 


tea 
a 
Bmwr De uw 
a 
bo 
Oo 
™ 
— 
w 


230 Mathematical Statistics for Applied Econometrics 


The vectors can then be normalized to one. However, to test for orthogonality, 


70/13 
(1 3 4){ -50/13 |] =0. (10.55) 
20/13 
Theorem 10.9. Every r-dimensional vector space, except the zero- 
dimensional space {0}, has an orthonormal basis. 


Theorem 10.10. Let {z1,---z,-} be an orthonomal basis for some vector space 
S, of R”™. Then each « € R™ can be expressed uniquely as 


cr=utv (10.56) 
where u€ S and v is a vector that is orthogonal to every vector in S. 


Definition 10.11. Let S be a vector subspace of R™. The orthogonal com- 
plement of S, denoted $+, is the collection of all vectors in R™ that are or- 
thogonal to every vector in S: that is, St = (2:2 € R™anda’y =0,Vy € S}. 


Theorem 10.12. If S is a vector subspace of R™, then its orthogonal com- 
plement S+ is also a vector subspace of R™. 


10.2. Projection Matrices 


The orthogonal projection of an m x 1 vector x onto a vector space S can be 
expressed in matrix form. Let {z1,---z,-} be any othonormal basis for S while 
{21,:+: 2m} is an orthonormal basis for R™. Any vector x can be written as 


u= (A121 +++ Oper) + (Arg 2r41 +++ Am%Zm) =U+tYV. (10.57) 


Ageregating a = (a, a9’)! where a, = (a1 --- a) and ag = (Qp41 ++: Am) 
and assuming a similar decomposition of Z = [Z, Z|, the vector x can be 
written as 
X=Za= Z\Q4 + 2202 
U= Z1Q4 (10.58) 
v= 2202. 


Given orthogonality, we know that Z| Z, =I, and Z{Z2 = [0], and so 


ay 


Q2 


ite GF te. Z| | . =(Z%, 0 | 


= Z\Q4 =U. (10.59) 
Theorem 10.13. Suppose the columns of them x r matrix Z, form an or- 
thonormal basis for the vector space S, which is a subspace of R™. Ifa € R™, 
the orthogonal projection of x onto S is given by 2, Z,x. 
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Projection matrices allow the division of the space into a spanned space 
and a set of orthogonal deviations from the spanning set. One such separation 
involves the Gram—Schmit system. In general, if we define the m x r matrix 
X, = (21,---a,) and define the linear transformation of this matrix that 
produces an orthonormal basis as A, 


Z, = X1A. (10.60) 
We are left with the result that 
Bay eA XK AS te (10.61) 


Given that the matrix A is nonsingular, the projection matrix that maps any 
vector x onto the spanning set then becomes 


Ps = ZZ) = X,AA'X!) =X, (X'X)' Xi. (10.62) 


Ordinary least squares is also a spanning decomposition. In the traditional 
linear model 
y=XB+e 
; (10.63) 
y=XB6 
within this formulation ( is chosen to minimize the error between y and esti- 
mated 4. 
(y— XB) (y— XB). (10.64) 


This problem implies minimizing the distance between the observed y and the 
predicted plane X 8, which implies orthogonality. If X has full column rank, 
the projection space becomes X (X'X ae X’ and the projection then becomes 


1 


XB =X (X'X) 1 X’y. (10.65) 


Premultiplying each side by X’ yields 
X'XB = X'X (X'X)' X'y 
B= (X'X)* X'X (X'X)' X'y (10.66) 


B=(X'X)* X’y. 


Essentially, the projection matrix is defined as that spanning space where the 
unexplained factors are orthogonal to the space a1 21 +--- a,z4. The spanning 
space defined by Equation 10.65 is identical to the definition of the spanning 
space in Equation 10.57. Hence, we could justify ordinary least squares as 
its name implies as that set of coefficients that minimizes the sum squared 
error, or as that space such that the residuals are orthogonal to the predicted 
values. These spanning spaces will also become important in the construction 
of instrumental variables in Chapter 11. 


232 Mathematical Statistics for Applied Econometrics 


10.3. Idempotent Matrices 


Idempotent matrices can be defined as any matrix such that AA = A. Note 
that the sum of square errors (S'SE) under the projection can be expressed 
as 


SSE = (y— XB)’ (y— XB) 


= (y ee «2 hee X'y) (y —~X(X'xX)™ X'y) 


(10.67) 
= (In —X(x'x)7} x’) y) (In Be at a aie x’) y 
yl (In —X(X'x)7? x’) (In my a0. <9, 0 ae x’) y. 
The matrix I, — X (X’X)~* X’ is an idempotent matrix. 
(In a(R x’) (In ORC x") = 
Pete (MO OR SIX ON PERC Ce 
Se IOI 
(10.68) 
Thus, the SSE can be expressed as 
SSE =y! (In ae ae.<o oli x’) y 
(10.69) 


=yy-yX (xX X'y=v'v 


which is the sum of the orthogonal errors from the regression. 


10.4. Eigenvalues and Eigenvectors 


Eigenvalues and eigenvectors (or more appropriately latent roots and charac- 
teristic vectors) are defined by the solution 


Ax = \x (10.70) 


for a nonzero x. Mathematically, we can solve for the eigenvalue by rearranging 


the terms 
Ar —rA\{cx = 0 
(10.71) 
(A-—AI)a =0. 
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Solving for \ then involves solving the characteristic equation that is implied 
by 
|A— AI] = 0. (10.72) 


Again using the matrix in the previous example, 


1 9 5 1 0 0 1—A 9 5 
3.7 8 | -A]O 1 OfJ= 3 7—A 8 
2 3 °5 0 0 1 2 3 5—A (10.73) 


= —5 +14. + 132 — 2° =0. 


In general, there are m roots to the characteristic equation. Some of these 
roots may be the same. In the above case, the roots are complex. Turning to 
another example, 


Bt he as 
A=|4 -2 3] >3r={1,2,5}. (10.74) 
4-4 5 


The eigenvectors are then determined by the linear dependence in the A — AI 
matrix. Taking the last example (with \ = 1), 


4 -3 3 
[A-AN]=] 4 -3 3 |. (10.75) 
A ds 


The first and second rows are linear. The reduced system then implies that as 
long as 21 = ®g and x3 = 0, the resulting matrix is zero. 


Theorem 10.14. For any symmetric matrix A there exists an orthogonal 
matrix H (that is, a square matrix satisfying H'H =I) such that 


H'AH=A (10.76) 


where A is a diagonal matriz. The diagonal elements of A are called the char- 
acteristic roots (or eigenvalues) of A. The ith column of H is called the char- 
acteristic vector (or eigenvector) of A corresponding to the characteristic root 


of A. 


This proof follows directly from the definition of eigenvalues. Letting H be 
a matrix with eigenvalues in the columns, it is obvious that 


AH = AH (10.77) 


by our original discussion of eigenvalues and eigenvectors. In addition, eigen- 
vectors are orthogonal and can be normalized to one. H is an orthogonal 
matrix. Thus, 

H'AH=H'AH=AH'H=A. (10.78) 
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One useful application of eigenvalues and eigenvectors is the fact that 
the eigenvalues and eigenvectors of a real symmetric matrix are also real. 
Further, if all the eigenvalues are positive, the matrix is positive definite. 
Alternatively, if all the eigenvalues are negative, the matrix is negative definite. 
This is particularly useful for econometrics because most of our matrices are 
symmetric. For example, the sample variance matrix is symmetric and positive 
definite. 


10.5 Kronecker Products 


Two special matrix operations that you will encounter are the Kronecker prod- 
uct and vec (.) operators. The Kronecker product is a matrix of an element by 
element multiplication of the elements of the first matrix by the entire second 
matrix 


ay1BoayBo::+  ainB 
anBo a2B :::  aanB 

ASB= (10.79) 
Ami B Qm2B tay QmnB 


The vec (.) operator then involves stacking the columns of a matrix on top of 
one another. 

The Kronecker product and vec(.) operators appear somewhat abstract, 
but are useful in certain specifications of systems of equations. For example, 
suppose that we want to estimate a system of two equations, 


Y1 = A190 + 11X11 + A12%2Q 


10.80 
Y2 = A290 + A121 + AQ2%2. ( ) 


Assuming a small sample of three observations, we could express the system 
of equations in a sample as 


Yu Y12 1 a1 2 Q19 20 
yor Yoo | =} 1 2a, Lae Q11 Qg1 |. (10.81) 
Y31 32 1 231, X32 Q21 A292 


As will be developed more fully in Chapter 11, we can estimate both equations 
at once if we rearrange the system in Equation 10.81 as 


Yu 1 241 “2 0 O 0 Q10 
Yi2 0 0 O 1 a1 Xe 11 
Y21 1 21 %2 0 O 0 a2 

= 10.82 
Y22 0 0 O 1 221 X22 Q20 ( ) 
Y31 1 23, «32 0 O 0 21 
Y32 0 0 O 1 231 2x39 Q22 
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Taking each operation in turn, the first operation is usually written as vecr (y) 
or making a vector by rows 


Y1 
Y12 Yi 
Yi Yi12 y a 
vecr Y21 Y22 | . | = ee (10.83) 
Y22 Y22 
Y¥31 = Y32 
Y31 
¥3 Y32 
Y32 


this is equivalent to vec (y’)). The next term involves the Kronecker product 
qd y 


@ |] 1 xa 22 


0 | 1 21 Xe 
1 
1 231 2X32 


1 
Iox2 @X = | 0 


1x/}l Ly 12 Ox 1 ry T12 | 


(10.84) 


0x1 Ly £12 | 1x fil Ly 12 


Following the operations through gives the X matrix in Equation 10.82. Com- 
pleting the formulation vec (a) is the standard vectorization 


Q10 
O41 O14 
Q10 20 a42 Os 
vec Q11 G21 = = con ‘ (10.85) 
21 22 20 ot 
Q21 
O22 
22 


Thus, while we may not refer to it as simplified, Equation 10.81 can be written 
as 
vecr (y) = [Ipx2 ® X] vec (a). (10.86) 


At least it makes for simpler computer coding. 
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10.6 Chapter Summary 


e The matrix operations developed in this chapter are used in ordinary least 
squares and the other linear econometric models presented in Chapter 11. 


e Matrix algebra allows us to specify multivariate regression equations and 
solve these equations relatively efficiently. 


e Apart from solution techniques, matrix algebra has implications for spanning 
spaces — regions that can be explained by a set of vectors. 


— Spanning spaces are related to the standard ordinary least squares es- 
timator — the mechanics of the ordinary least squares estimator guar- 
antees that the residuals are orthogonal to the estimated regression 
relationship. 


— The concept of an orthogonal projection will be used in instrumental 
variables techniques in Chapter 11. 


10.7 Review Questions 
10-1R. If tr [A] = 13, then what is tr[A +k x Imxm]? 


10.8 Numerical Exercises 


10-1E. Starting with the matrix 


5 
A=] 3 (10.87) 
2 


Re w 
NB DS 


compute the determinant of matrix A. 


10-2E. Compute B = A+2 x [3,3 where I3y3 is the identity matrix (i-e., 
a matrix of zeros with a diagonal of 1). 


10-3E. Compute the inverse of B = A+ 2 x I3x3 using row operations. 


10-4E. Compute the eigenvalues of B. Is B negative definite, positive 
definite, or indefinite? 
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10-5E. Demonstrate that AAy = |A| J. 


10-6E. Compute the inverse of A using the cofactor matrix (i.e., the fact 
that AAy =|A|J). Hint — remember that A is symmetric. 


10-7E. As a starting point for our discussion, consider the linear model 


Tt = 0 + a Rk; + aA (D/A), + & (10.88) 


where r; is the interest rate paid by Florida farmers, R; is the Baa 
Corporate bond rate, and A (D/A), is the change in the debt to 
asset ratio for Florida agriculture. For our discussion, we assume 


1 
Re 
A(D/a), 
0.07315 0.00559 0.07315 0.00655 0.00045 
ON 1.00000 0.07315 1.00000 0.08535 0.00905 
0.08534 |} ’ | 0.00655 0.08535 0.00793 0.00063 
0.00904 0.00045 0.00905 0.00063 0.00323 
(10.89) 
— Defining 
1.00000 0.08535 0.00905 
Uxx = | 0.08535 0.00793 0.00063 (10.90) 
0.00905 0.00063 0.00323 
and 
0.07315 
Uxy = | 0.00655 (10.91) 
0.00045 


compute 8 = My De: 
— Compute Dyy — DyxLyyUxy where Dyy = 0.00559. 


11 


Regression Applications in Econometrics 


CONTENTS 


11.1 Simple Linear Regression ............ 0.0 cece eee ences 
11.1.1 Least Squares: A Mathematical Solution ................ 
11.1.2 Best Linear Unbiased Estimator: A Statistical Solution 
11.1.3 Conditional Normal Model ..................... 02.2 e eee 
11.1.4 Variance of the Ordinary Least Squares Estimator ..... 
11.2 Multivariate Regression ........... cece cece cece eee 
11.2.1 Variance of Estimator ............ 00. ce cece cece eee eee ee 
11.2.2. Gauss—Markov Theorem ................0 cece eee eee ences 
11.3 Linear Restrictions .......... 0.0 cee cece een cere er eenteee 
11.3.1 Variance of the Restricted Estimator ................... 
11.3.2 Testing Linear Restrictions ......................0.0.000. 
11.4 Exceptions to Ordinary Least Squares ............... 0. cece ee eee 
11.4.1 Heteroscedasticity 2.0.00... ccc cece cece eee ee eee eines 
11.4.2 Two Stage Least Squares and Instrumental Variables .. 
11.4.3 Generalized Method of Moments Estimator ............ 
115° “Chapter Summary «cnt ladssae tues me Dae aa ee tame aad 
1146. Review Questions: «sc0 1 cuce fen deseo fete hoe atest Gise eee das 
11.7 Numerical Exercises .......... 0.0.02. ccc e eee e nee eee e ence eaeees 


The purpose of regression analysis is to explore the relationship between two 
variables. In this course, the relationship that we will be interested in can be 


expressed as 
Yi = at Pay +e 


(11.1) 


where y; is a random variable and x; is a variable hypothesized to affect or 


drive y;. 


(a) The coefficients a and ( are the intercept and slope parameters, 


respectively. 
(b) These parameters are assumed to be fixed, but unknown. 
(c) The residual e; is assumed to be an unobserved, random error. 


(d) Under typical assumptions E [e;] = 0. 
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(e) Thus, the expected value of y; given x; then becomes 
E [yi] = a + Bai. (11.2) 


The goal of regression analysis is to estimate a and ( and to say something 
about the significance of the relationship. From a terminology standpoint, y 
is typically referred to as the dependent variable and « is referred to as the 
independent variable. Casella and Berger [7] prefer the terminology of y as 
the response variable and x as the predictor variable. This relationship is a 
linear regression in that the relationship is linear in the parameters a and £. 
Abstracting for a moment, the traditional Cobb—Douglas production function 
can be written as 


Taking the natural logarithm of both sides yields 
In (y;) = In(@) + Bln (a;). (11.4) 


Noting that In (a) = a, this relationship is linear in the estimated parameters 
and thus can be estimated using a simple linear regression. 


11.1 Simple Linear Regression 


The setup for simple linear regression is that we have a sample of n pairs of 
variables (#1, y1) ,--+ (@n; Yn). Further, we want to summarize this relationship 
by fitting a line through the data. Based on the sample data, we first describe 
the data as follows: 


1. The sample means 


she 
tis Y= — ve (11.5) 


2. The sums of squares 


(11.6) 


3. The most common estimators given this formulation are then given by 


Say 


wen 


,a=y-— Bz. (11.7) 
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11.1.1 Least Squares: A Mathematical Solution 


Following our theme in the discussion of linear projections, this definition 
involves minimizing the Residual Squared Error (RS'S) by the choice of a 
and £3. 


Hite 9 = 2 (yi — (a+ Bai). (11.8) 


Focusing on a first, 


1 (11.9) 


A = ui= es) — (9 80), 
=D (yi = 8) ~ B (wi - 2) x (11.10) 
(yi Y) Xi BY (wi - 8) ai 


Going from this result to the traditional estimator requires the statement 
that 


(11.11) 


i=l 


since ny — Sy. y; = 0 by definition of y. The least squares estimator of 3 


then becomes 
Sry 


Sra 


p= (11.12) 
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TABLE 11.1 
U.S. Consumer Total and Food Expenditures, 1984 through 2002 
Total Food Food Share 

Year Expenditure (EZ) Expenditure In(£) of Expenditure 
1984 21,975 3,290 10.00 14.97 
1985 23,490 3,477 10.06 14.80 
1986 23,866 3,448 10.08 14.45 
1987 24,414 3,664 10.10 15.01 
1988 25,892 3,748 10.16 14.48 
1989 27,810 4,152 10.23 14.93 
1990 28,381 4,296 10.25 15.14 
1991 29,614 4,271 10.30 14.42 
1992 29,846 4,273 10.30 14.32 
1993 30,692 4,399 10.33 14.33 
1994 31,731 4,411 10.37 13.90 
1995 32,264 4,505 10.38 13.96 
1996 33,797 4,698 10.43 13.90 
1997 34,819 4,801 10.46 13.79 
1998 35,535 4,810 10.48 13.54 
1999 36,995 5,031 10.52 13.60 
2000 38,045 5,158 10.55 13.56 
2001 39,518 5,321 10.58 13.46 
2002 40,677 5,375 10.61 13.21 


Example 11.1 (Working’s Law of Demand). Working’s law of demand is 
an economic conjecture that the percent of the consumer’s budget spent on 
food declines as income increases. One variant of this formulation presented 
in Theil, Chung, and Seale [50] is that 


Whoodt = Pfood,t4f ood,t say oh Bln (E) (11.13) 


B= Do Pidi 


where Wfood,t represents the consumer’s budget share for food in time period 
t, and E is the total level of expenditures on all consumption categories. In 
this representation 6 < 0, implying that a > 0. Table 11.1 presents consumer 
income, food expenditures, the natural logarithm of consumer income, and 
food expenditures as a percent of total consumer expenditures for the United 
States for 1984 through 2002. The sample statistics for these data are 


& = 10.3258 y = 14.1984 
Srv = 0.03460 Sy = —0.09969 (11.14) 
B = —2.8812 & = 43.9506. 


The results of this regression are depicted in Figure 11.1. Empirically the data 
appear to be fairly consistent with Working’s law. Theil, Chung, and Seale find 
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that a large portion of the variation in consumption shares across countries 
can be explained by Working’s law. 


In order to bring the results from matrix algebra into the discussion, we 
are going to use the unity regressor form and rewrite the x and y matrices as 


ra 


Re ik i i i i i i i 


9.9977 
10.0643 
10.0802 
10.1029 
10.1617 
10.2332 
10.2535 
10.2960 
10.3038 
10.3318 
10.3650 
10.3817 
10.4281 
10.4579 
10.4783 
10.5185 
10.5465 
10.5845 
10.6134 


| 


14.9700 
14.8000 
14.4500 
15.0100 
14.4800 
14.9300 
15.1400 
14.4200 
14.3200 
14.3300 
13.9000 
13.9600 
13.9000 
13.7900 
13.5400 
13.6000 
13.5600 
13.4600 
13.2100 


(11.15) 
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First, we derive the projection matrix 
P, =X (X'X) X’ (11.16) 


which is a 19 x 19 matrix (see Section 10.2). The projection of y onto the 
dependent variable space can then be calculated as 


Poy = X (X'X)7" X'y (14:17) 
in this case a 19 x 1 space. The numerical result of this projection is then 


15.14521 
14.95329 
14.90747 
14.84206 
14.67262 
14.46659 
14.40809 
14.28563 
14.26315 
P.y = | 14.18247 | . (11.18) 
14.08680 
14.03867 
13.90497 
13.81910 
13.76031 
13.64447 
13.56379 
13.45429 
13.37101 


Comparing these results with the estimated values of y from the linear model 
yields 
a+ 9.99778 = 15.14521 
a + 10.06438 = 14.95329 


. (11.19) 
a+ 10.61348 = 13.37101 
11.1.2 Best Linear Unbiased Estimator: A Statistical 
Solution 
From Equation 11.2, the linear relationship between the zs and ys is 
E[yi] =a + Ba; (11.20) 


and we assume that 
V (yi) = 0°. (11.21) 
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The implications of this variance assumption are significant. Note that we 
assume that each observation has the same variance regardless of the value 
of the independent variable. In traditional regression terms, this implies that 
the errors are homoscedastic. 

One way to state these assumptions is 


Yi = a+ Bay + &% 
(11.22) 
E[e;] =0, V(e;) = 07. 


This specification is consistent with our assumptions, since the model is ho- 
moscedastic and linear in the parameters. 

Based on this formulation, we can define the linear estimators of a and ( 
as 


So diyi. (11.23) 
t=1 


An unbiased estimator of @ can further be defined as those linear estimators 
whose expected value is the true value of the parameter 


This implies that 


(11.25) 


The linear estimator that satisfies these unbiasedness conditions and yields 
the smallest variance of the estimate is referred to as the best linear unbiased 
estimator (or BLUE). In this example, we need to show that 


d= +t = § = | (11.26) 
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minimizes the variance for all such linear models. Given that the y;s are un- 
correlated, the variance of the linear model can be written as 


T=) (=1. ¢=1 


The problem of minimizing the variance then becomes choosing the djs to 
minimize this sum subject to the unbiasedness constraints 


n 
mino? y a 
d; ‘ 

i=l 


ate diay (11.28) 
w=1 


i=1 


Transforming Equation 11.28 into a Lagrangian form, 


w=1 4=1 t=1 


OL 


Bd; = 27 4 — Ati — w= 0 


peo ee baer ae 
a (11.29) 


Using the results from the first n first-order conditions and the second con- 


straint, we have 


t=1 
Me ie nL 
Pag aoe (11.30) 
oe 
Sp el 
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Substituting this result into the first n first-order conditions yields 


mr rv _ 
‘ (11.31) 
“age ee) 
Substituting these conditions into the first constraint, we get 
“ 
1- —, (4; —Z%) xz; =0 
2 2 
SX= = Z 
S> (ai — £) 2: (11.32) 
i=l 
(a; — Z) (a; — Z) 
i Sox 
ye (a; — Z) a; 
i=1 


This proves that the simple least squares estimator is BLUE on a fairly 
global scale. Note that we did not assume normality in this proof. The only 
assumptions were that the expected error term is equal to zero and that the 
variances were independently and identically distributed. 


11.1.3. Conditional Normal Model 


The conditional normal model assumes that the observed random variables 
are distributed 
yi ~ N (a+ Bai,o7). (11.33) 


The expected value of y given x is a+ (Gx and the conditional variance of y; 
equals o?. The conditional normal can be expressed as 


E[y;|e;] = a + aj. (11.34) 
Further, the €; are independently and identically distributed: 


6 =y—a— Bx; 
eae N (0,07) (11.35) 


(consistent with our BLUE proof). 
Given this formulation, the likelihood function for the simple linear model 
can be written 


Hone = ae es S a | (11.36) 
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Taking the log of this likelihood function yields 
n n 2 1 2 
In(Z) =~} In(2n) — FIn(o aD a) ; (11.37) 


Thus, under normality the ordinary least squares estimator is also the maxi- 
mum likelihood estimator. 


11.1.4 Variance of the Ordinary Least Squares Estimator 


The variance of 3 can be derived from the results presented in Section 11.1.2. 
Note from Equation 11.32, 


p= 3 diyi = = 52! —— o) (a + Bx; + €;) 
ms (11.38) 


i=1 i=1 i=1 


Under our standard assumptions about the error term, we have 


In addition, by the unbiasedness constraint of the estimator, we have 
S > dia = 0. (11.40) 
i=l 


Leaving the unbiasedness result 


E (3) = pif ee = 9 (11.41) 
i=l 


However, remember that the objective function of the minimization problem 
that we solved to get the results was the variance of parameter estimate 


Vv (3) = od (11.42) 


This assumes that the errors are independently distributed. Thus, substituting 
the final result for d; into this expression 


VO-“E Sg -e- w 


Noting that the numerator of this fraction is the true sample variance yields 
the Student’s t-distribution for statistical tests of the linear model. Specifically, 
the slope coefficient is distributed t with n — 2 degrees of freedom. 
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11.2 Multivariate Regression 


Given that the single cause model is restrictive, we next consider a multivariate 
regression. In general, the multivariate relationship can be written in matrix 
form as 


Bo 
y=(1 a1 a2) | Br ) =6o+ fier + Bore. (11.44) 
Bo 
If we expand the system to three observations, this system becomes 
Y1 1 21 212 Bo 
Y2 = 1 221 £22 By 
Yo 1 2x31 232 Bo 


(11.45) 
Bo + B1t11 + Ber12 
=| Bot P1%21 + Bore2 
Bo + B1%31 + Box32 


Given that the X matrix is of full rank, we can solve for the {s. In a statistical 
application, we have more rows than coefficients. 
Expanding the exactly identified model in Equation 11.45, we get 


Yi 1 21 X12 €1 
Yo 1 £91 X22 Bo €2 
~ Bi Pp (11.46) 
Y3 1 231 2X32 Bo €3 
Ya 1 41 Laz €4 


In matrix form, this can be expressed as 
y=XP+e. (11.47) 
The sum of squared errors can then be written as 
SSE = (y— 9) (y-9) = (y— XB) (y— XB) 
= (y' — BX") (y— XB). 
Using a little matrix calculus, 


dSSE = d{(y— XBy'} (y— XB) +(y— XBy'd{y-XB)} 44 4g) 
= — (dB)' X’ (y— XB) - (y— XB)’ XdB 


(11.48) 


(see Magnus and Nuedecker [29] for a full development of matrix calculus). 
Note that each term on the left-hand side is a scalar. Since the transpose of a 
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scalar is itself, the left-hand side can be rewritten as 
dSSE = —2(y— XB) XdB 


dSSE os 
ge = 2(y — XB) X =0 


y'X — p'X'X =0 


(11.50) 
y!X = p'X'X 


X'y = X'XB 
(X'X)' X'y = B. 
Thus, we have the standard result 8 = (X’X)~* X’y. Note that as in the two- 


parameter system we do not make any assumptions about the distribution of 
the error (e€). 


11.2.1 Variance of Estimator 


The variance of the parameter matrix can be written as 


v (8) -E (4-8) (8-8)']. (11.51) 


y=XPt+esB=(X'X)' Xy 


Working backward, 


= (XX) X' (XB +6) 
(11.52) 
SR OOO) ee 
= B+(X'X) ' X'e. 

Substituting this back into the variance relationship in Equation 11.51 yields 
Vv (3) =E (xx) X'ee'X (x'x)] (11.53) 

Note that ee’ = o7I (i.e., assuming homoscedasticity); therefore 

V (3) =E (xx) X’ee'X (2'x) "| 
H(X'X)* Kopi 

(11.54) 
a6 (0 x) RR 


=07(X'X)*, 
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Again, notice that the construction of the variance matrix depends on the 
assumption that the errors are independently and identically distributed, but 
we do not assume a specific distribution of the € (i.e., we do not assume 
normality of the errors). 


11.2.2 Gauss—Markov Theorem 


The fundamental theorem for most econometric applications is the Gauss— 
Markov theorem, which states that the ordinary least squares estimator is the 
best linear unbiased estimator. The theory developed in this section represents 
a generalization of the result presented in Section 11.1.2. 
Theorem 11.2 (Gauss—Markov). Let 8* = C’y where C is aT x K constant 
matrix such that C'X =I. Then, 6 is better than 3* if B* 4 B. 
Proof. Starting with 

Be =B+C'u (11.55) 
the problem is how to choose C’. Given the assumption C’X = 1, the choice 
of C guarantees that the estimator 6* is an unbiased estimator of 6. The 
variance of 6* can then be written as 


V (B*) =E[C'uu'C] 
= C’E [uu] C (11.56) 


ei Oa On 
To complete the proof, we want to add a special form of zero. Specifically, we 
want to add 02 (X’X)* — 02 (X’X)* =0. 
Vi) se OR) Aer se7 OR) (11.57) 
Focusing on the last terms, we note that by the orthogonality conditions for 
the C matrix, 
/ 
WE = (c ae 4 (x'x)*) (c =< (x'x)*) 
NOON OCR), OE ay Oe A AO ey 
(11.58) 
Substituting backwards, 


o?C'C — 0? (X'X) | =0? love — (x'x)] 


—1 


=o? lore ORY ORY RICA RO KO) 


ag? (c a4 cea (c mes (x'x)*) 
(11.59) 
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Thus, 


V (8) = 02 (X'X) 71 +0? [(c’ —(x'x)7} x") (c _xX (x'x)*)] 


(11.60) 
The minimum variance estimator is then C = X (X'X)~*, which is the ordi- 
nary least squares estimator. 


Again, notice that the only assumption that we require is that the residuals 
are independently and identically distributed - we do not need to assume 
normality to prove that ordinary least squares is BLUE. 


11.3. Linear Restrictions 


Consider fitting the linear model 


y = Bo + Bixi + Bore + B3x3 + Bata + € (11.61) 
to the data presented in Table 11.2. Solving for the least squares estimates, 


4.7238 
4.0727 

B= (X'X)*(X'y) =| 3.9631 |. (11.62) 
2.0185 
0.9071 


Estimating the variance matrix, 


[eo Lea ryy)—t i 


Vv (3) = §2(X'x)7} 
(11.63) 
0.5037 —0.0111 —0.0460 0.0252 —0.0285 
0.0111 0.0079 —0.0068 0.0044 —0.0033 
=| —0.0460 —0.0068 —0.0164 —0.0104 — 0.0047 
0.0252 0.0044 —0.0104 0.0141 —0.0070 
0.0285 —0.0033 0.0047 —0.0070 0.0104 


Next, consider the hypothesis that 6, = 82 (which seems plausible given 
the results above). As a starting point, consider the least squares estimator 
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TABLE 11.2 
Regression Data for Restricted Least Squares 
Observation Yy Ly r2 L3 vA 
1 79.72173 4.93638 9.76352 4.389735 = 2.27485 


2 45.11874 6.95106 3.11080 —1.96920 3.59838 
3 51.61298 4.69639 4.17138 3.84384 2.73787 
4 92.53986 10.22038 8.93246 1.73695 5.86207 
5 118.74310 12.05240 12.22066 6.40735 4.92600 
6 80.78596 10.42798 5.58383 1.61742 9.30154 
7 43.79312 2.94557 5.16446 1.21681 4.75092 
8 47.84554 3.54233 5.58659 2.18433 3.65499 


9 63.02817 4.56528 6.52987 4.40254 5.36942 
10 88.83397 11.47854 8.82219 0.70927 2.94652 
11 104.06740 11.87840 8.53466 5.21573 8.91658 
12 57.40342 7.99115 7.42219 —3.62246 —2.19067 
13 76.62745 7.14806 7.39096 5.19569 3.00548 
14 109.96540 10.34953 9.82083 = 7.82591 7.09768 
15 72.66822 7.74594 4.79418 5.39538 6.29685 
16 68.22719 4.10721 8.51792 4.00252 3.88681 
17 122.50920 12.77741 = 11.57631 6.85352 7.63219 
18 70.71453 9.69691 6.54209 0.53160 0.79405 
19 70.00971 6.46460 6.62652 4.31049 5.03634 
20 79.82481 6.31186 8.49487 3.38461 5.93793 
21 38.82780 3.04641 2.99413 2.69198 6.26460 
22 79.15832 8.85780 = 7.29142 = 3.83994 =. 2.86917 
23 62.29580 5.82182 6.16096 4.18066 1.73678 
24 80.63698 4.97058 9.83663 6.71842 3.47608 
25 77.32687 5.90209 8.56241 5.42130 4.70082 
26 23.34500 1.57363 2.82311 0.95729 0.69178 
27 81.54044 9.25334 6.43342 5.02273 =. 3.84773 
28 67.16680 10.77622 5.21271 —0.87349 —1.17348 
29 47.92786 6.96800 2.39798 —0.56746 6.08363 
30 48.58950 7.06326 3.24990 —0.77682 3.09636 


as a constrained minimization problem. 
L (8) = (y— XB) (y— XB) + 2N (r— RB) 
= (y! = B'X') (y — XB) + 2N (r — RB) (11.64) 
Var L (8B) = X' (y— XB) + X" (y— XB) —-2F/A=0 


where VL (8) is the gradient or a row vector of derivatives. The second 


1Technically, the gradient vector is defined as 


_f aL aL (B aL (B 
erty [ BE A Ba 
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term in the gradient vector depends on the vector derivative 


/ 


Var (XB) = Vor (BX)! = (Var (8X)! = (XY =X. (11.65) 
Solving for the first-order condition for {, 
X’ (y—XB)+X' (y— XB) -2R'A’=0 
2(X'y) — 2(X'X) B-2R'’\ =0 
(11.66) 


(XX) B = (X'y) — RIN 
B = (X'X)* (X'y) — (X/X)" RY. 
Taking the gradient of the Lagrange formulation with respect to A yields 
Vy L(8)=r—RB=0. (11.67) 


Substituting the solution of 8 into the first-order condition with respect to 
the Lagrange multiplier, 


r—R (xx) (X'y) — (X'X)7| R'A] =0 

R—R(X'X) * (X'y) + R(X'X) | RA =0 

(11.68) 
R(X'X) | R= R(X'X) | (X'y)— 3 
A= [R(X'X) RB] (R(X'X) 7 (X'y)- 1). 
Note that substituting 8 = (X’X)~' (X’y) into this expression yields 
-1 

cs [R xy R (RB —r). (11.69) 


Substituting this result for A back into the first-order conditions with respect 
to 6 yields 


Br = (X'X)*(X'y) — 
(X'X)7! R’ [R (X'X) 7 R')  (R(XX) (X'y) — 1) (11.70) 
= B—(X'X)" R'[R(X’'X) R']* (RB —r). 


Thus, the ordinary least squares estimates can be adjusted to impose the 
constraint RB —-—r=0. 
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11.3.1 Variance of the Restricted Estimator 
Start by deriving 8 — E[8] based on the previous results. 


B= OCXY OCR P+ OHS WX) RP [R (Xe R a 
x (R COR Re eR ee r) 
BIg) =(x°X)" (x'x) 6 - (x/x) R[R(X'X) | RI] (11.71) 
x (R (X'X) 1 xB - r) 
6 —B[6] = (X'X) 71 Xe — (X'X) RR [R (x'x)7! R R 
OR, 
Computing (8 — E[8]) (6 — E[@])’ based on this result, 


ee ee ee BO 


PBC). 


(8 — B[8]) (8 — E[6])’ = (X'X)~ 
SHOOK RRO) RRO RT 


1 


a [R(xXy R'] R(X'X) XlelX (XIX) 44 79) 


1 


=i 
£(xX)  R [R Oa) ame R' R(X'X) | X'ee! 


-1 
Noa <b 6 mae <1 [R xa R RR 
Taking the expectation of both sides and noting that E [ee’] = 07, 


V (8) = E [(6 — E[6]) (6 - E[4])’] 
(11.73) 
=o? (xx) SOx [R Gx R R(X'X)"] 
Again, the traditional ordinary least squares estimate of the variance can be 
adjusted following the linear restriction RG — r= 0 to produce the variance 
matrix for the restricted least squares estimator. 


11.3.2 Testing Linear Restrictions 


In this section we derive the F-test of linear restrictions. Start with the deriva- 
tion of the error under the restriction 


er=y—XBR 
(11.74) 


=y—X6—X (Br-£). 
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Compute the variance under the restriction 


ener = (e— X (Br — B))' (e— X (Br — B)) 
= ée—& X (Br— 8B) —(Br—B) X'e+(Br—B) X'X (Br- 8). 


(11.75) 
Taking the expectation of both sides (with E [e] = 0), 
E (eper) = E (ee) + (6x — 6)’ X'X (Br — 8) (11.76) 
= E(eper) — E(€'e) = (Br — 8)’ X'X (Br — 6). 
From our foregoing discussion, 
-1 
Ba ~ 8 =X) [R (xixy* R (r — R6). (11.77) 
Substituting this result back into the previous equation yields 
-1 
B (ener) — E (ee) = (r — BBY [R(X’X)* R] 
-1 
R(X'X) 1 XX (XX) R [R Ce R'] (r — RB) (11.78) 
-1 
= (r — RBY [R Coem R (r — RB). 
Therefore, the test for these linear restrictions becomes 
(cre — €€) /q 
F k)= 
anh) ée/(n—k) 
(11.79) 


(r — Ray [R(X'X)| R']  (r— RB) /q 
ée/ (n—k) 


where there are q restrictions, n observations, and k independent variables. 


11.4 Exceptions to Ordinary Least Squares 


Several departures from the assumptions required for BLUE are common in 
econometrics. In this chapter we address two of the more significant departures 
— heteroscedasticity and endogeneity. Heteroscedasticity refers to the case 
where the errors are not identically distributed. This may happen due to a 
variety of factors such as risk in production. For example, we may want to 
estimate a production function for corn that is affected by weather events that 
are unequal across time. Alternatively, the risk of production may be partially 
a function of one of the input levels (i.e., the level of nitrogen interacting 
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with weather may affect the residual by increasing the risk of production). 
Endogeneity refers to the scenario where one of the regressors is determined 
in part by the dependent variable. For example, the demand for an input is 
affected by the price, which is affected by the aggregate level of demand. 


11.4.1 Heteroscedasticity 


Using the derivation of the variance of the ordinary least squares estimator, 
B+ (X'X) (X’e) 


p= 
=V (3 3) =( PENN (eR). (11.80) 


Vv (3) See RISK OX) S98 Swed] 
under the Gauss—Markov assumptions S$ = E [ee’] = o?Ipy-r. 

However, if we assume that S = Efee’] #4 o?Ipyr, the ordinary least 
squares estimator is still unbiased, but is no longer efficient. In this case, we 
use the generalized least squares estimator 


B = (X'AX)* (X' Ay). (11.81) 
The estimator under heteroscedasticity (generalized least squares) implies 
B= (X'AX)* (X'AXB + X' Ae) 
=(X'AX) (XIAX) B+ COCAX) X'Ae (11.82) 
= B+(X'AX)* X' Ae. 
The variance of the generalized least squares estimator then becomes 
Vv (3 = ) a AR) AeA Ke AN 
(11.83) 
= (MAK) RASA OCAX) 
Setting A= S71, 


-1 


V ( = 8) = (X'AX)! X'AX (X'AX)) 3: A= A 


(11.84) 

= (X'AX). 
The real problem is that the true A matrix is unknown and must be es- 
timated. For example, consider Jorgenson’s KLEM (Capital, Labor, Energy, 
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TABLE 11.4 
Estimated Parameters for the Agricultural Production Function 
Parameter Estimate 
Qo —18.7157*** 
(2.3003) 
Qy 1.5999*** 
(0.3223) 
ag 0.3925** 
(0.1364) 
a3 —0.9032*** 
(0.1741) 
a4 1.4807 
(0.2590) 


Where *** and ** denotes statistical significance 
at the 0.01 and 0.05 level of respectively. 
“Numbers in parenthesis denote standard errors. 


and Materials) data for the agricultural sector presented in Table 11.3. Sup- 
pose that we want to estimate the Cobb-Douglas production function using 
the standard linearization. 


In (yz) = Qo + Q In (24) + a2 In (xz) + ag In (234) + 4 In (ay) + &. (11.85) 


The ordinary least squares estimates of Equation 11.85 are presented in Ta- 
ble 11.4. For a variety of reasons, there are reasons to suspect that the residuals 
may be correlated with at least one input. As depicted in Figure 11.2, in this 
case we suspect that the residuals are correlated with the level of energy used. 
In order to estimate the possibility of this relationship, we regress the esti- 
mated error squared from Equation 11.85 on the logarithm of energy used in 
agriculture production. 


é = Bo + Bln (ase) + 4. (11.86) 


The estimated parameters in Equation 11.86 are significant at the 0.05 level 
of confidence. Hence, we can use this result to estimate the parameters of the 
A matrix in Equation 11.81. 


Au = Bo + Bin (xx) > B= (x'Ax)~ (x’Ay) (11.87) 


where f is the estimated generalized least squares (EGLS) estimator. 

One important point to remember is that generalied least squares is al- 
ways at least as efficient as ordinary least squares (i-e., A could equal the 
identity matrix). However, estimated generalized least squares is not neces- 
sarily as efficient as ordinary least squares — there is error in the estimation 
of A. 
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FIGURE 11.2 
Estimated Residual Squared. 


Seemingly Unrelated Regressions 


One of the uses of generalized least squares is the estimation of simul- 
taneous systems of equations without endogeneity. Derived input demand 
equations derived from cost minimization implies relationships between the 
parameters 
2, =A, +Ayuy t+ Arwe +Tiyt+ea 
2 = ag + Agiwi + Agw2 + Taiy + €2 


(11.88) 


where x; and 22 are input levels, w, and we are the respective input prices, 
y is the level of output, and Q1, 2, Au, Ajo, Aoi, Ao, Tu, and To1 are 
estimated parameters. 

Both relationships can be estimated simultaneously by forming the regres- 
sion matrices as 


L11 1 wi we yi O O 0 60 €11 
X12 1 wig we yo 0 O 0 60 fe €12 
; ; Au 
* : ‘ Aj2 
Zin | _ | 1 Win Won yn 0 0 0 0 Th 4 | €1n 
21 0 0 0 0 1 W11 W21 Y1 a2 €21 
22 0 0 0 O 1 wig we Yy Agi €22 
Age 
[21 
Lon 0 O 0 O 1 Win Wen Yn €2n 
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It would be tempting to conclude that this formulation implies that the in- 
put demand system requires generalized least squares estimation. Specifically, 
using a two-step methodology, we can estimate the parameter vector using 
ordinary least squares. The ordinary least squares coefficients could then be 
used to estimate the variance for each equation. This variance could then be 
used to estimate the A matrix. 


1 
| 3 0 0 O O sail 
0 4 0 0 0 0 
1 
A 2 a 11.90 
‘cents 1 
— 0 O 0 2 0 0 (11.90) 
0 O 0 0 z 0 
0 oO -- 0 0 QO -:: 4+ 


2 


However, the separable nature of the estimation implies that there is no change 
in efficiency. To introduce changes in efficiency, we need to impose the restric- 
tion that Ayg = A2:. Imposing this restriction on the matrix of independent 
variables implies 


L11 1 wi we yi O O 0 €11 

X12 1 wiz we yo O 0 0 ay €12 
: : : A 

Zin _ 1 Win W2n Yn 0 0) 0 oe +4 Ein 

©21 0 0 wy O 1 wa mM a €21 

L22 0 O wi2 0 1 wee YY Ago €22 
P21 

Z2n 0 0 Win 0 1 Wan Yn €2n 

(11.91) 


In the latter case, generalized least squares will yield efficiency gains. 

The estimation Equation 11.91 requires a combination Kronecker product 
(see Equations 10.81 through 10.86), implying Equation 11.89 can be written 
as 

vecr (y) = X ® Inx2vecB + vec (€) (11.92) 


and then restricting Ajz = Ag; using restricted least squares using Equa- 
tion 11.70. Given these estimates, the researcher can then construct a sample 
estimate of the variance matrix to adjust for heteroscedasticity. 


11.4.2 Two Stage Least Squares and Instrumental Variables 


The foregoing example does not involve dependency between equations. For 
example, assume that the supply and demand curves for a given market can 
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TABLE 11.5 
Ordinary Least Squares 


Ordinary Least 


Parameter Squares 
Qo 1.2273 
(1.4657) 
Q4 3.3280 
(0.1930) 
(ap) —0.7181 
(0.1566) 


be written as 
qs =—3+ 4p, — po + €1 


(11.93) 
da = 10 — py + 2y + €2. 
Solving this two equation system yields 
13 1 2 
_ : 11.94 
Bie *s am 5P2 tT 5¥~ 1 + €2 ( ) 


Ignoring the problem of simultaneity, the supply equation can be estimated 
as 
ds = A + A1pi + Agp2 + €1. (11.95) 


The results for this simple estimation are presented in Table 11.5. Obviously, 
these results are not close to the true values. Why? The basic problem is a 
simultaneous equation bias. Substituting the solution of p, into the estimated 
equation yields 


13.01 2 
ds =A + Qy ( 5 =e P27 5Y ata) + QgP2 + €4. (11.96) 
Substituting 
= 131 2 .. 
PES Ge + 5P2 5Y bea > Pi =pPi-& (11.97) 


we note that the x matrix is now correlated with the residual vector. Specifi- 
cally 
E [pier] = —o? £0. (11.98) 


Essentially the ordinary least squares results are biased. 
Two Stage Least Squares 


The first approach developed by Theil [55] was to estimate the reduced form 
of the price model and then use this estimated value in the regression. In this 
example, 


pi =Yot+Npet yy tv. (11.99) 
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TABLE 11.6 
First-Stage Estimation 


Ordinary Least 


Parameters Squares 
0 2.65762 
(0.23262) 
v1 0.15061 
(0.03597) 
2 0.40602 
(0.01863) 
TABLE 11.7 
Second-Stage Least Squares Estimator of the Demand Equation 
Two Stage 
Parameter Least Sqares 
Bo 9.9121 
(0.9118) 
By —1.0096 
(0.2761) 
Bo 2.0150 
(0.1113) 


The parameter estimates for Equation 11.99 are presented in Table 11.6. Given 
the estimated parameters, we generate p; and then estimate 


gq? = Go + G1 py) + Gopo + €2. (11.100) 
In the same way, estimate the demand equation as 


q* = Bo + Bip + Boy + e2. (11.101) 


The results for the second stage estimates of the demand equation are pre- 
sented in Table 11.7. 


Generalized Instrumental Variables 


The alternative would be to use variables as instruments to remove the cor- 
relation between endogenous variables. In this case, we assume that 


y=XB+e. (11.102) 
Under the endogeneity assumption, 
1 

woe + 0. (11.103) 


But, we have a set of instruments (Z) which are correlated with the residu- 
als and imperfectly correlated with X. Thegeneralized instrumental variable 
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solution is 
Bry = (X'PzX)* (X'Pzy) (11.104) 


where Pz = Z(Z'Z)~' Z' (see the derivation of the projection matrix in 
ordinary least squares in Section 10.2). 
In the current case, we use Z = [1 po yj, yielding 


—3.2770 
Brv = | 3.9531 |. (11.105) 
—0.7475 


The estimates in Equation 11.105 are very close to the original supply function 
in Equation 11.93. 


11.4.3. Generalized Method of Moments Estimator 


Finally we introduce the generalized method of moments estimator (GMM), 
which combines the generalized least squares estimator with a generalized 
instrumental variable approach. Our general approach follows that of Hall 
[17]. Starting with the basic linear model, 


Yt = @00 + Ut (11.106) 


where y; is the dependent variable, x; is the vector of independent variables, 6 
is the parameter vector, and uz is the residual. In addition to these variables, 
we will introduce the notion of a vector of instrumental variables denoted z;. 
Reworking the original formulation slightly, we can express the residual as a 
function of the parameter vector. 


ut (90) = Ys — 7,90- (11.107) 


Based on this expression, estimation follows from the population moment 


condition. 
E [zzz (90)] = 0. (11.108) 


Or more specifically, we select the vector of parameters so that the residuals 
are orthogonal to the set of instruments. 

Note the similarity between these conditions and the orthogonality condi- 
tions implied by the linear projection space. 


Pea X (XR Xe. (11.109) 


Further developing the orthogonality condition, note that if a single 
solves the orthogonality conditions, or that 0) is unique, then 


E [z,uz (0)] = 0 if and only if @ = Oo. (11.110) 


Alternatively, 
E [zz (0)] 4 Oif 8 F Oo. (11.111) 
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Going back to the original formulation, 
E [zeue (0)| = Elz (ye — 2740). (11.112) 
Taking the first-order Taylor series expansion, 


E [2+ (ye — 249)] = E [2 (ye — 2400)] — E [124] (0 — 90) 


11.113 
= mp (yz: — 2,0) = —2. ( ) 

Given that E [z: (yz — x.00)] = E [zeuz (00)| = 0, this expression implies 
E [zz (ye — 2,9)| = E [ze] (00 — 0) - (11.114) 


Given this background, the most general form of the minimand (objective 
function) of the GMM model (Q; (@)) can be expressed as 


Qr (9) = {puiey zh wr {5z'u()} (11.115) 


where T is the number of observations, u (0) is a column vector of residuals, 
Z is a matrix of instrumental variables, and Wr is a weighting matrix (akin 
to a variance matrix). 

Given that Wr is a type of variance matrix, it is positive definite, guaran- 
teeing that 


2’Wrz >0 (11.116) 
for any vector z. Building on the initial model, 
1 
E [z,ut (0)] = pul) ‘ (11.117) 
In the linear case, i 
E [z,uz (0)] = ze (y — X6). (11.118) 


Given that Wr is positive definite, the optimality condition when the resid- 
uals are orthogonal to the variances based on the parameters is 


E [zeut (60)] > a 6) 0 S0e(s) =o: (11.119) 


Working the minimization problem out for the linear case, 
1 


Qr (8) = Fa [(y — X80)! Z] We [Z' (y — X0)] 
1 
== [y ZW, — OX! ZW] [Z'y — Z'X6] 
1 
= Fa ly! ZWerZ'y —O'X'ZWeZ'y — y! ZW Z'X0 +. 0 X' ZW’ X06). 


(11.120) 
Note that since Qr (6) is a scalar, 0) X'ZW rp Z'y = y' ZW Z'X0. Therefore, 


1 
Qr (8) = 53 ly! ZWrZ'y + 0°X' ZW Z'X0 — 20X'ZWrZ'y). (11.121) 
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Solving the first-order conditions, 


1 
VoQr (0) = = [2X'ZWrZ' XO — 2X'ZWrZ'y] =0 
- (11.122) 


= 6 =(X'ZWrZ'X) | (X'ZWrZ'y). 
An alternative approach is to solve the implicit first-order conditions above. 


Starting with 
1 
VoQr (9) = Fo [2X'’ZWrZ' XO —2X' ZW Z'y] =0 


1 1 1 
=> 5 V0Qr (0) = (7'2) Wr (72'x0) = 


1 1 
aT, —Zly)\ = 
(742) (pau) =o 


1 1 
= (7%'z) Wr (42'x6 - z'y) =0 (11,123) 
-(1x/2)w I 3H X60} |) =0 
=\F T\ Fp y = 


4 Ey One (73'2) Wr (52 {y — X0} ) iG 
ss (7%'2) Wr (72106) = 


Substituting u (0) = y— X86 into Equation 11.123 yields the same relationship 
as presented in Equation 11.122. 


The Limiting Distribution 


By the Central Limit Theorem, 


T 
Z'u(0) = Fa ait (0) 4 N(0,S). (11.124) 


Therefore 
1 a d ; 
FA (6 - 60) 4.N (0, MSM’) 
M = (B[2,2/] WE [z,24])* E[2,2/] W 
(11.125) 


T T 
1 
S= jm, r d 2 E [uzu.22,| = E [u?z2"] 


MSM = {E[z2/]} S71 {BE [a:zi]}. 
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Under the classical instrumental variable assumptions, 
mn (11.126) 


Example 11.3 (Differential Demand Model). Following Theil’s model for the 
derived demands for inputs, 


fid\n [qi] = 6d In [O] + S$ mijn [pj] + (11.127) 


j=1 


where f; is the factor share of input 7 (f; = pigi/C such that p; is the price 
of the input, gq is the level of the input used, and C is the total cost of 
production), O is the level of output, and ¢; is the residual. The model is 
typically estimated as 


FieD [qi] = O:D [Oi] + So mig [pye] + ee (11.128) 


j=l 


such that Fe = 4 (fit + fit-1) and D [xe] =In [z+] —In [xe-1]. 

Applying this to capital in agriculture from Jorgenson’s [22] database, the 
output is an index of all outputs and the inputs are capital (pez), labor (pit), 
energy (pez), and materials (pz). Thus, 


Ee 
X=(|O; Pct Plt Pet Die lone 


(11.129) 
oe 
Z= [Or Pct Pit Pet Pmt OF Pet Pit Pet Dele 


Rewriting the demand model, 
y = X60. (11.130) 
The objective function for the generalized method of moments estimator is 
Qr (0) = (y— X80) ZWrZ! (y— X80). (11.131) 


Initially we let Wp = I and minimize Qr (0). This yields a first approximation 
to the estimates in the second column of Table 11.8. Updating Wr, 


a2 
Sow = GuE (11.132) 


and resolving yields the second stage generalized method of moments estimates 
in the second column of Table 11.8. 
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TABLE 11.8 
Generalized Methods of Moments Estimates of Differential Demand Equation 


First Stage Second Stage Ordinary Least 


Parameter GMM GMM Squares 
Output 0.01588 0.01592 0.01591 
(0.00865) (0.00825) (0.00885) 
Capital —0.00661 —0.00675 —0.00675 
(0.00280) (0.00261) (0.00280) 
Labor 0.00068 0.00058 0.00058 
(0.03429) (0.00334) (0.00359) 
Energy 0.00578 0.00572 0.00572 
(0.00434) (0.00402) (0.00432) 
Materials 0.02734 0.02813 0.02813 
(0.01215) (0.01068) (0.01146) 
ee 


11.5 Chapter Summary 


e Historically ordinary least squares has been the standard empirical 
method in econometrics. 


e The classical assumptions for ordinary least squares are: 


— A general linear model y = Xf + €. 
* Sometimes this model is generated by a first-order Taylor series 
expansion of an unknown function 
pw OF) 
y=f(x)=f (ao) +X aes + &; (11.133) 
Bo & Ora) 


where the residual includes the approximation error. Note that this 
construction may lead to problems with heteroscedasticity. 


* 


Alternatively, it may be possible to transform the model in such a 
way as to yield a linear model. For example, taking the logarithm 
of the Cobb-Douglas production function (y = agzf' x5?) yields a 
linear model. Of course this also has implications for the residuals, 


as discussed in Chapter 12. 
— The independent variables have to be fixed (i.e., nonstochastic). 
— The residuals must be homoscedastic (i.e., ee’ = 07 INx.y). 


— Given that the model obeys the assumptions, the estimates are best 
linear unbiased regardless of the distribution of the errors. 
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e If the residuals of a general linear model are normally distributed, the 
ordinary least squares estimates are also maximum likelihood. 


e One frequently encountered exception to the conditions for best linear 
unbiased estimator involves differences in the variance (ee 4 o?Iy x). 


— This condition is referred to as heteroscedasticity. The problem is 


typically corrected by the design of a weighting matrix A such that 
eAe’ = o7 INN. 


This correction for heteroscedasticity opens the door to the estimation 
of simultaneous equations. Specifically, we can estimate two different 
equations at one time by realizing that the variance of the equation is 
different (i.e., seemingly unrelated regression). 


It is important that generalized least squares is always at least as good 
as ordinary least squares (i.e., if A is known — in fact if regression is 
homoscedastic, then A = Ir,7). However, estimated generalized least 
squares need not be as efficient as ordinary least squares because the 
estimate of A may contain error. 


e One of the possible failures for the assumption that X is fixed involves 
possible correlation between the independent variables and the residual 
term (i.e., E[X’e] 4 0). 


11.6 
11-1R. 


11-2R. 


11-3R. 


11-4R. 


These difficulties are usually referred to as endogeneity problems. 


The two linear corrections for endogeneity are two-stage least squares 
and instrumental variables. 


Review Questions 


True or false — we have to assume that the residuals are normally 
distributed for ordinary least squares to be best linear unbiased? 
Discuss. 


Why are the estimated ordinary least squares coefficients normally 
distributed in a small sample if we assume that the residuals are 
normally distributed? 


When are ordinary least squares coefficients normally distributed 
for large samples? 


Demonstrate Aitkin’s theorem [49, p. 238] that 6 = (X’AX) X’ Ay 
yields a minimum variance estimator of (. 


270 Mathematical Statistics for Applied Econometrics 


11.7 Numerical Exercises 
10-1E. Regress 


ry = a9 +a, R, + a2A (D/A), +6 (11.134) 


using the data in Appendix A for Georgia. Under what conditions 
are the estimates best linear unbaised? How would you test for 
heteroscedasticity? 


12 
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The techniques developed in Chapter 11 estimate using linear (or iteratively 
linear in the case of two-stage least squares and the generalized method 
of moments) procedures. The linearity was particularly valuable before the 
widespread availability of computers and the development of more complex 
mathematical algorithms. However, the innovations in computer technology 
coupled with the development of statistical and econometric software have lib- 
erated our estimation efforts from these historical techniques. In this chapter 
we briefly develop three techniques that have no closed-form (or simple linear) 
solution: nonlinear least squares and maximum likelihood, applied Bayesian 
estimators, and least absolute deviation estimators. 


12.1 Nonlinear Least Squares and Maximum Likelihood 


Nonlinear least squares and maximum likelihood are related estimation tech- 
niques dependent on numerical optimization algorithms. To develop these rou- 
tines, consider the Cobb-Douglas production function that is widely used in 
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both theoretical and empirical economic literature: 


y = aoryt x5? (12.1) 
where y is the level of output, and x; and x2 are input levels. While this 
model is nonlinear, many applications transform the variable by taking the 
logarithm of both sides to yield 


In (y) = Go + a1 In (x1) + ag In (x2) + € (12.2) 


where Q@ = In(ao). While the transformation allows us to estimate the pro- 
duction function using ordinary least squares, it introduces a significant as- 
sumption about the residuals. Specifically, if we assume that ¢« ~ N (0, a”) so 
that we can assume unbiasedness and use ¢t-distributions and F-distributions 
to test hypotheses, the error in the original model becomes log-normal. Specif- 
ically, 


In (y) = @o + a1 In (a1) + ag In (xq) +E > y = agrt aye (12.3) 


which yields a variance of exp (0? +1 /2p). In addition, the distribution is 
positively skewed, which is inconsistent with most assumptions about the 
error from the production function (i.e., the typical assumption is that most 
errors are to the left [negatively skewed] due to firm level inefficiencies). 

The alternative is to specify the error as an additive term: 


y = agrp ars? +. (12.4) 


The specification in Equation 12.4 cannot be estimated using a simple lin- 
ear model. However, the model can be estimated using either nonlinear least 
squares, 
: ~ a1 ,,a2\2 
ae L (ao, 01, 02) = » (Yi — C0@y; £7 ) (12.5) 

which must be solved using iterative nonlinear or maximum likelihood tech- 
niques. 

Consider the corn production data presented in Table 12.1 (taken from the 
first 40 observations from Moss and Schmitz [34]). As a first step, we simplify 
the general form of the Cobb-Douglas in Equation 12.5 to 


40 
in L i pe 50g > 12: 
aduin L (a) » (yi — 502}') (12.6) 


(i.e., we focus on the effect of nitrogen on production). Following the standard 
formulation, we take the first derivative of Equation 12.6, yielding 


OL ane sae 
1 a1 ay 
Au 2 ) (= In (@1;) x 50x77) (ys — 50x77) (12.7) 


i=l 
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FIGURE 12.1 
Minimum of the Nonlinear Least Squares Formulation. 


and then solve this expression for the a; that yields OL (a1) /Oa; = 0. While 
this expression may be tractable, we typically solve for the a; using a numer- 
ical method known as Newton—Raphson. 

To motivate this numerical procedure, notice that we are actually trying 
to find the zero of a function 


40 
g(a) = 00 (In (a1i) eq}) (yi — 5027}) . (12.8) 


Figure 12.1 presents the squared error and the derivative (gradient) of the 
least squares function for a1 € (0.01, 0.23). From the graphical depiction, it is 
clear that the minimum error squared occurs at around 0.185. The question 
is how to find the exact point in a systematic way. 

Newton’s method finds the zero of a function (in this case the zero of 
the gradient g (a1) in Equation 12.8) using information in the derivative. For 
example, assume that we start at a value of a; = 0.15, which yields a value 
of g(a) of -1,051,808. Graphically, draw a triangle based on the tangency 
of g(a ) at that point and solve for the value of @; such that g(@,) = 0. 
To develop this concept a little further, consider the first-order Taylor series 
expansion of g (a1): 


Og (a 
g (a1) © g (a?) + Og (a1) (a; — af). (12.9) 
day a1=a? 
Solving for @ such that g (G1) = 0 implies 
0 
0 g (02) 
ay=a 12.10 
ICT eas! 
Oa, 20 
ay ay 
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TABLE 12.2 
Newton—Raphson Iterations for Simple Cobb-Douglas Form 
a? g (ar) L (ay) Og (a1) /Oay Aa, ay 


0.15000 —1,051,807.80 46,762.50 18,698,025.70 —0.056252 0.20625 
0.20625 695,878.17 29,446.05 46,600,076.03 0.014933 0.19132 
0.19132 71,859.09 23,887.35  37,253,566.76 0.001929 0.18939 
0.18939 1,055.24 23,817.36 36,163,659.92 0.000029 0.18936 
0.18936 0.24 23,817.35 36,147,366.06 0.000000 0.18936 


Given that the derivative of g (a1) is 


Og (a1) = a 2 a1 ay a1\2 
a= =e [(in (wii) v4} ) (y — 50a?) — 50 In (In (a1;) x7) 


(12.11) 
the value of the derivative of g(a1) at 0.15 is 18,698,026. Thus, the next 
estimate of the a; that minimizes the nonlinear least squares is 


ie —1, 051, 808 

a, = 0.15 18, 698,026 ~ 0.20625. (12.12) 
Evaluating L (a1) at this point yields a smaller value (29,446.05 compared 
with 46,762.50). Table 12.2 presents the solution of the minimization problem 
following the Newton—Raphson algorithm. To clean up the proof a little, note 
that g (a1) = OL (a1) /Oa,; the Newton—Raphson algorithm to minimize the 
nonlinear least squares is actually 


& =at- ee ; (12.13) 


a1=a9 

To expand the estimation process to more than one parameter, we return to 
the problem in Equation 12.5. In addition, we need to introduce the concept 
of a gradient vector, which is essentially a vector of scalar derivatives. The 
gradient of Equation 12.5 with respect to the three parameters ({ag, a1, @2}) 
is a3 x 1 vector 


OL (a) = 2 5 rt} aro? (y ie ary} ©; ) 
arte 
Vol (a) _ we = —2a9 s ry; M5? In ( (x1;) (y— ary; r5; ) 
OL (a) 
Jaz —2a0 3 xt, 252 In (wai) (y — aort} 75?) 


(12.14) 
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The Hessian matrix, which is the 3 x 3 matrix equivalent to the second deriva- 
tive, is defined as 


pee a°L (a) oe) F) 


das Jaya, Oapday 
OL(a) OL(a) AL (a) 
2 LE (a) = 12.1 

Vo i) 0a,0ao daz 0a; 0a ( 2) 

OL(a) OL(a) AL (a) 

Oa20Aaq Oa2da2 daz 
The matrix form of Equation 12.13 can then be expressed as 

j=a°— [V2,L(a)]'Vab(a)) (12.16) 


The numerical solution to the three parameter Cobb-Douglas is presented in 
Appendix A. 

To develop the distribution of the nonlinear least squares estimator, con- 
sider a slight reformulation of Equation 12.5: 


L(x,yla) =[y— f (2,)]' [y—f (z,a)]. (12.17) 


So we are separating the dependent variable y from the predicted component 
f (a, q@). Given this formulation, we can then define the overall squared error 
of the estimate (s* (a)): 


s* (a) = 07 +f [f («, a") — f (@,a)]° du (x) (12.18) 


where a* is the level of the parameters that minimize the overall squared error 
and @ is a general value of the parameters. Also notice that Gallant [14] uses 
measure theory. Without a great loss in generality, we rewrite Equation 12.18 
as = 

s(a)=o?+ f [fF (a,0*) - F(e,0)} dG (o) (12.19) 
where dG (x) = g(x) dz is the probability density function of x. Next, we 
substitute a nonoptimal value (a°) for a*, yielding an error term f (x, a°) + 
e= f (z,a*). Substituting this result into Equation 12.20, 


s(e,a°,a) =o? +f [e+ f (z,a°) —f (x,a)]° dG (a) . (12.20) 
Looking forward, if a® + a* then e+ f (a; a°) > f (x,a*) and s (e, a®, a) => 
s* (a). Next, we take the derivative (gradient) of Equation 12.20, 


Vas (e,a°,a) = 2 f- [e+ f (v,a°) — f (x, a)] Vaf (x, a) dG (x). 
_ (12.21) 
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Notice that in this formulation (e+ f (x,a°) — f (x,a)) is a scalar number 
while V.f (,q@) is a vector with the same number of rows as the number 
of parameters in the nonlinear expression. Taking the second derivative of 
Equation 12.20 with respect to the parameter vector (or the gradient of Equa- 
tion 12.21) yields 
2 0 _ ! 
Vi2g8 (€,a a) =2 f [Vaf (2,0) Vaf (2, a) (12.22) 
—2 le +f (z,a°) —f (x, a)] VeeF (x, a)] dG (a). 
A couple of things about Equations 12.21 and 12.22 are worth noting. 

First, we can derive the sample equivalents of Equations 12.21 and 12.22 as 


Vas (e,a°,a) > 2[e+F (2,a°) — F (a,0)| VoF (2,0) 
Voa s(e, a Oo) SON a) VaF (2, a) (12.23) 


=D [ei +f (a;,0°) —f (x;,a)] V2 oF (ti, 0). 


Second, following the concept of a limiting distribution, 
1 

VN 

where p is the number of parameters. This is basically the result of the central 


limit theorem given that 1/N [e+ F (a,a°) — F (x,0)]'1 — 0 (where 1 is a 
conformable column vector of oe By a similar conjecture, 


VoF (a°) FN Np (0.075 7 VoF (a) VaF (0)) (12.24) 


VN (a—-a°) FN Np (000° 0? — [VaF (a)! VaF(a)]'). (12.25) 
Finally, the standard error can be estimated as 
1 
si = Nop ly — F (x,a)|' ly — F (x, a)]. (12.26) 


Of course the derivation of the limiting distributions of the nonlinear least 
squares estimator is typically superfluous given the assumption of normal- 
ity. Specifically, assuming that the residual is normally distributed, we could 
rewrite the log-likelihood function for the three parameter Cobb-Douglas 
problem as 


N i 
In (L(2,yla,07)) « Sig. In (07) — 362 zDW ; — aoxrt} w5F 2)? (12.27) 


Based on our standard results for maximum likelihood, 
-1 


N 
x 1 Q1,,a2\2 
BON | @,—5 > V2.4 (Yi — aoe} 22?) : (12.28) 
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Comparing the results of Equation 12.24 with Equation 12.28, any real differ- 
ences in the distribution derive from Va F (a) VaF (a) & V2,,F (a). 


12.2 Bayesian Estimation 


Historically, Bayesian applications were limited by the existence of well formed 
conjugate families. In Section 8.2 we derived the Bayesian estimator of a 
Bernoulli parameter with a beta prior. This combination led to a closed-form 
posterior distribution — one that we could write out. Since the late 1990s ad- 
vances in both numerical procedures and computer power have opened the 
door to more general Bayesian applications based on simulation. 


12.2.1 Basic Model 


An empirical or statistical model used for research purposes typically assumes 
that our observations are functions of unobservable parameters. Mathemati- 
cally, 


p(y|@) (12.29) 


where p(.) is the probability of y — a set of observable outcomes (i.e., crop 
yields) and 6 is a vector of unobservable parameters (or latent variables). Un- 
der normality p (y |v, a”) means that the probability of an observed outcome 
(y = 50.0) is a function of unobserved parameters such as the mean of the 
normal (jz) and its variance (07). The formulation in Equation 12.29 is for any 
general outcome (i.e., any potential level of y). It is important to distinguish 
between general outcome — y — and a specific outcome that we observe — y°. 
In general, we typically refer to a specific outcome as “data.” 

If we do not know what the value of @ is, we can depict the density function 
for 0 as p(0). We can then combine the density function for 0 (p(0)) with 
our formulation of the probability of the observables (p(y|@)) to produce 
information about the observables that are not conditioned on knowing the 
value of 6. 


p(y) = / p(8)p(y|8) a0. (12.30) 


Next, we index our relation between observables and unobservables as model 
A. Hence, p(y|0) becomes p(y |@4, A), and p(@) becomes p (04 |A). We shall 
denote the object of interest on which decision making depends, and which 
all models relevant to the decision have something to say, by the vector w. 
We shall denote the implications of the model A for w by p(w|y,64,A). 
In summary, we have identified three components of a complete model, A, 
involving unobservables (often parameters) 04, observables y, and a vector of 
interest w. 
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12.2.2. Conditioning and Updating 


Given this setup, we can sketch out the way to build on information from 
our prior beliefs about the unobservable parameters using sample informa- 
tion. As a starting point, we denote the prior distribution as p(04|A). This 
captures our initial intuition about the probability of the unobservable pa- 
rameters (i.e., the mean and variance of the distribution). These expectations 
are conditioned on a model — A (i.e., normality). In part, these prior beliefs 
specify a distribution for the observable variables p(y |@4,A). Given these 
two pieces of information, we can derive the probability of the unobservable 
parameters based on an observed sample (y°). 


_ p(Oa,y°|A) — p(0a|A) p(y? |84, A) 
p(0aly°, A) = p(y |A) _ p(y°|A) 


where p (0 A ly, A) is referred to as the posterior distribution. 

As a simple example, assume that we are interested in estimating a simple 
mean for a normal distribution. As a starting point, assume that our prior 
distribution for the mean is a normal distribution with a mean of jj and a 
known variance of 1. Our prior distribution then becomes 


(12.31) 


filslint) = Fee | a (12.32) 


Next, assume that the outcome is normally distributed with a mean of js from 
Equation 12.32. 


fo(ylu,k) = 


aae| ly | (12.33) 


where & is some fixed variance. The posterior distribution then becomes 


f(uly?,k) x fr (uit, 1) fo (y? |, &) 


Suge ex (u = ji)” (y° = Hw). ven 
_ Inv k x 2 2k 


(ignoring for the moment the denominator of Equation 12.32). One application 
of this formulation is to assume that we are interested in the share of cost 
associated with one input (say the constant in the share equation in a Translog 
cost function). If there are four inputs, we could assume a priori (our prior 
guess) that the average share for any one input would be 0.25. Hence, we set 
ji = 0.25 in Equation 12.34. The prior distribution for any y® would then 
become 


f (1 |v?) 0 exp ( 05) (y aw (12.35) 
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Next, assume that we have a single draw from the sample y° = 0.1532 — the 
empirical posterior distribution would then become 


1 — 0.25)? ae 
ee (4 — 0.25)? (0.1532 — p) 
Ink 2 2k 


f (u|y° = 0.1532, k) « 


(12.36) 
Returning to the denominator from Equation 12.32, p (y° |A) is the proba- 
bility of drawing the observed sample unconditional on the value of 6, in this 
case, unconditioned on the value of w. Mathematically, 


f (0.1532) = 
it pee: (uw — 0.25)? (0.1532 — pu)” 7 (12.37) 
ye (= SAP 5 a | du. = 0.28144 


(computed using Mathematica). The complete posterior distribution can then 
be written as 


1 (— 0.25)? — (0.1532 — ps)? 
exp 
nV k 2 2k 


f (uly° = 0.1532, k) = : 
(12.38) 
One use of the posterior distribution in Equation 12.38 is to compute a 
Bayesian estimate of yw. This is typically accomplished by minimizing the loss 
function 


Min,, L (u, 48) = E[w— pel’ > we = Ely] 
co (12.39) 
=> E[p] =| f (u|y° = 0.1532, k) udu = 0.2016 


—oCo 


again relying on Mathematica for the numeric integral. 
Next, consider the scenario where we have two sample points y° = 
{0.1532, 0.1620}. In this case the sample distribution becomes 


1 (0.1532 — yu)” + (0.1620 — 2)? 
ae? 12.40 
The posterior distribution can then be derived based on 
f (uly’.k) « 


Lee a 0.25)? (0.1532 — x)? + (0.1620 — ju)” (12.41) 
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Integrating the denominator numerically yields 0.2300. Hence, the posterior 
distribution becomes 


f (uly? k) = 


rs (42 — 0.25)” (0.1532 — px)? + (0.1620 — pu)” (12.42) 
ave 2 Ok 
0.2300 


which yields a Bayesian estimate for ys of 0.1884. Note that this estimator is 
much higher than the standard mean of 0.1576 — the estimate is biased upward 
by the prior. 

In the foregoing development we have been playing a little fast and loose 
with the variance of the normal. Let us return to the prior distribution for the 


mean. 4 
F a 
jt,o2) = ep oe (12.43) 
,/2702 
bb 


Here we explicitly recognize the fact that our prior for y (based on the normal 
distribution) has a variance parameter Ge For the next wrinkle, we want to 
recognize the variance of the observed variable (y): 


1 
2 
ig aes |S 
, \/ 2002 20y 


where o; is the variance of the normal distribution for our observed variable. 

Based on the formulation in Equation 12.44 we have acquired a new un- 
known parameter — oe Like the mean, we need to formulate a prior for this 
variable. In formulating this variable we need to consider the characteristics 
of the parameter — it needs to be always positive (say V = , > 0). Following 
Section 3.5, the gamma distribution can be written as 


ee ee goa exp fea such that 0 < vu < co 


f(Vla,p)=% Pla)e™ 


Ja (u 


fo (y 


(12.44) 


(12.45) 
0 otherwise. 


Letting a = 2.5 and 6 = 2, the graph of the gamma function is depicted in 
Figure 12.2. 

Folding Equation 12.43 and Equation 12.45 together gives the complete 
prior (i.e., with the variance terms) 


fi (u, V |i, 02,0, 8) = fi (u 


ji,o7,) x fiz (V ja, 8) 


(12.46) 
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FIGURE 12.2 
Gamma Distribution Function. 
The sample distribution then becomes 
fo (y|n,02) = ——exp (y= Hy (12.47) 
SS Se 2V | 
The posterior distribution then becomes 
e 1 a-l 
f (@,07,0, Bly) 4 
i \/ 2007. (v anV) T (a) 6° 
(u-f)  (y-n) OV 
x 
ee 202 wv CB 
(12.48) 
1 a—l1 
2no,VAP (a) Bo 
(u— ji)” (y-n) V 
x on | 202 2V B 


For simplification purposes, the gamma distribution is typically replaced with 
the inverse gamma. 

Focusing on the exponential term in Equation 12.48, we can derive a form 
of a mixed estimator. The first two terms in the exponent of Equation 12.48 
can be combined. 


2 2 
20%, 2V 201,V 


(w=)? w= HP _ (HPV EW on 


(12.49) 
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With a little bit of flare, we are going to divide the numerator by oF, and let 
¢=V/o7,, yielding 
(u-f)  (y-w)? _ ou f+ (y-w)? (12.50) 
207, 2V 2V : : 


Maximizing Equation 12.50 with respect to w gives an estimator of jz condi- 
tional on fj and ¢: 
~_ outy 
= : 12.51 
ee (12.51) 


Next, let us extend the sample distribution to include two observations, 


2 
fo (yrs ye |us og) = [J fe (vi |u, 0%) 
i=1 


Folding the results for Equation 12.52 into Equation 12.48 implies a two ob- 
servation form of Equation 12.53. 


(12.52) 


(HH) (mH) t+ e-H)? oH) + (mH) + (ew 


20%, 2V 2V 
(12.53) 
With a little bit of effort, the value of w that maximizes the exponent can be 
derived as 
P.~ yt ye 
ght 5 
i= Sa a (12.54) 
eae 
= 2 
Defining 7 = (y1 + y2) /2, this estimator becomes 
Gof 
5 Sg’ 
i= a (12.55) 
144 
3 2 


This formulation clearly combines (or mixes) sample information with prior 
information. 
Next, consider extending the sample to n observations. The sample distri- 


bution becomes 
n 


fo (yi, y2,-+- yn |u,o2) = I[f (yi |u, 05 ) 


5 1)? (12.56) 
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TABLE 12.3 
Capital Share in KLEM Data 


Year Share 
“1960 0.1532 — 
1961 0.1620 
1962 0.1548 
1963 0.1551 
1964 0.1343 
1965 0.1534 
1966 0.1587 
1967 0.1501 
1968 0.1459 
1969 0.1818 


Combining the exponent term for posterior distribution yields 


n 


ne L(y — HY? o(u— fp) +> MH - B)? 
(u ame i=1 - i=l ; (12.57) 
202 2V QV 


Following our standard approach, the value of uw that maximizes the exponent 
term is 


Gs 
ae = 
j= as sn pS =, (12.58) 
eee : 
nm 


Consider the data on the share of cost spent on capital inputs for agri- 
culture from Jorgenson’s KLEM data presented in Table 12.3. Let us assume 
a prior for this share of 0.25 (i.e., 1/4 of overall cost). First, let us use only 
the first observation and assume ¢ = 2.0. The mixed estimator of the average 
share would be 


2.0 x 0.25 + 0.1532 


7 = 0.2177. 12. 
1+ 2.0 nant 2) 
Next, consider an increase in ¢ to 3.0. 
3.0 x 0.25 + 0.1532 
te = 0.2258. 12.60 
e 1+3.0 ey) 
Finally, consider reducing ¢ to 1.0. 
1.0 x 0.25 + 0.1532 
pa es = 0.2016. (12.61) 


1+1.0 


As ¢ increases, more weight is put on the prior. The smaller the ¢, the closer 
the estimate is to the sample value. 
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Next, consider using the first two data points so that y = 0.1576. Assuming 
ob = 2.0, 
2.0 0.25 + 0.1576 
7 “yz X 0.25 + 0. 
B= 2.0 
ones 
2 
Notice that this estimate is closer to the sample average than for the same ¢ 
with one observation. Expanding to the full sample, ¢@ = 2.0 and y = 0.1549 
yields 


= 0.2038. (12.62) 


. 38 x 0.25 + 0.1549 
a 20 


= 0.1708. (12.63) 


Finally, notice that as n — oo the weight on the prior vanishes so that the 
estimate converges to the sample average. 


12.2.3. Simple Estimation by Simulation 


Consider the small sample on the share of capital cost for production agricul- 
ture from the KLEM data. Suppose we assume a prior normal distribution 
with a mean of f = 0.25 and a coefficient of variation of 1.25. The variance 
would then be (0.25 x 1.25)” = 0.097656. The posterior distribution would 


then be 
1 (a — 0.25)? 
aly? |] = ———__—. ex 
flay") Jinx 0096757 2x stir 
10 
¥ 1 se (0.1532 — a)” + (0.1620 — a)* +--+ (0.1818 — aa 
\/ 2702 2a; 


(12.64) 
The concept is then to estimate a by integrating the probability density func- 
tion by simulation. Specifically, 


Bayes = f af [a|y°] da. (12.65) 


As a starting point, assume that we draw a value of a based on the prior 
distribution. For example, in the first row of Table 12.4 we draw a = 0.8881. 
Given this draw, we compute the likelihood function of the sample. 


L (a|y°) x exp 


(0.1532 — 0.8881)” + (0.1620 — 0.8881)” + - -- (0.1818 — 0.8881)? 
2 x 0.096757 


= 8.5594E — 13. (12.66) 
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TABLE 12.4 
Simulation Share Estimator 


Draw Qa L(a yf”) ax La y°) 


1 0.8881 8.5594E-13 7.6018E-13 
2 0.1610 9.9130E-01 1.5960E-01 
3 0.1159 9.1806E-01 1.0642E-01 
4 0.3177 2.5264E-01 8.0261 E-02 
5 0.0058 3.1454E-01 1.8129E-03 
6 
7 
8 


—0.0174 2.1401E-01 —3.7269E-03 
0.2817 4.3307E-01 1.2198E-01 

0.9582 3.2764E-15 3.1395E-15 

9 0.3789 7.4286E-02 2.8149E-02 
10 —0.1751 3.5680E-03 —6.2479E-04 
11 —0.0477 1.1906E-01 —5.6766E-03 
12 0.2405 6.8021E-01 1.6360E-01 
13 0.2461 6.4644E-01 1.5908E-01 
14 0.2991 3.3906E-01 1.0143E-01 
15 0.7665 3.9991E-09 3.0655E-09 
16 0.0952 8.2591E-01 7.8617E-02 
17 = —0.0060 2.6062E-01 —1.5565E-03 
18 0.2745 4.7428E-01 1.3020E-01 
19 0.6151  1.7537E-05 1.0787E-05 
20 0.3079 2.9626E-01 9.1228E-02 


Summing over the 20 observations, 


20 
S- aL [ai ly? | 
QBayes — oe 


PEN L [a ly] 


Expanding the sample to 200 draws, @payes = 0.1601. 


= 0.1769. (12.67) 


12.3. Least Absolute Deviation and Related Estimators 


With the exception of some of the maximum likelihood formulations, we have 
typically applied a squared error weighting throughout this textbook. 


e=y— XiB 


. (12.68) 


L(8) = 7 dla). 
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p(é) 
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-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 


—Squared --~-Absolute ——Huber € 


FIGURE 12.3 
Alternative Residual Functions. 


However, the squared error choice is to some extent arbitrary. Figure 12.3 
presents three different residual functions: the standard residual squared 
function, the absolute value function, and a Huber weighting function [19]. 
The absolute value function assumes 


p (ei) = |e (12.69) 


while the Huber function is 


12 
ice x; for lei] <k 12.70 
ples) ke] — 4k? for |e] > k Gen) 


with k = 1.5 (taken from Fox [12]). Each of these functions has implications 
for the weight given observations farther away from the middle of the dis- 
tribution. In addition, each estimator has slightly different consequences for 
the “middle” of the distribution. As developed throughout this textbook, the 
“middle” for Equation 12.68 is the mean of the distribution while the “middle” 
for Equation 12.69 yields the median (6). 


5s.t. F (e;) =1— F (e) = 0.50 (12.71) 


where F'(e;) is the cumulative density function of ¢;. The “middle” of the 
Huber function is somewhat flexible, leading to a “robust” estimator. 
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12.3.1 Least Absolute Deviation 


Following the absolute value formulation of p(¢;) in Equation 12.69, we can 
formulate the Least Absolute Deviation Estimator (LAD) as 


N N 
“ago 1 
min +5 S > (yi — Bo — fits) = W S- wi — Bo — Bra (12.72) 
t=1 t=1 


where y; are observed values for the dependent variable, x; are observed values 
of the independent variables, and (6 is the parameters to be estimated. To 
develop the concept of the different weighting structures, consider the first- 
order conditions for the general formulation in Equation 12.72. 


N 


OL 1 
ee Pe) N » p (yi — Bo — 612i) (-1) 
i" Ve (12.73) 
ue A) N S> 0! (yi — Bo — Bix) (—2:). 
1d 


If p(e;) is the standard squared error, Equation 12.73 yields the standard set 
of normal equations. 
Op (yi — Bo — Pix) 
OBy 


However, if p(¢;) is the absolute value, the derivative becomes 


= 2 (yi — Bo — Bi xi) xi. (12.74) 


Op (yi — Bo — Bix:) _ x; forys < Bo + Bix; (12.75) 
Op, —a; for y; > Bo + Pix; ; 
which cannot be solved using the standard calculus (i.e., assuming a smooth 


derivative). 
Bassett and Koenker [3] develop the asymptotic distribution of the param- 
eters as 


VN (6% — 8) 4 N (0. au Kal") (12.76) 


where w? is an asymptotic estimator of the variance (i.e., w? = 1/N Sear 


2 
(yi — XiB)"). 

In order to demonstrate the applications of the LAD estimator, consider 
the effect of gasoline and corn prices on ethanol. 


Pet = Bo Bipgt BoDet + Et (12.77) 


where pez is the price of ethanol at time t, pg is the price of gasoline at time 
t, Pet is the price of corn at time t, e& is the residual, and 69, 81, and (2 
are the parameters we want to estimate. Essentially, the question is whether 
gasoline or corn prices determine the price of ethanol. The data for 1982 
through 2013 are presented in Table 12.5. The parameter estimates using 
OLS, LAD, quantile regression with +t = 0.50 discussed in the next section, 
and two different Huber weighting functions are presented in Table 12.6. 


Survey of Nonlinear Econometric Applications 289 


TABLE 12.5 
Ethanol, Gasoline, and Corn Prices 1982-2013 
Nominal Prices Real Prices 
Year Ethanol Gasoline Corn PCE Ethanol Gasoline Corn 
1982 1-71 1.00 2.55 50.479 3.631 2.123 5.415 
1983 1.68 0.91 3.21 52.653 3.420 1.853 6.535 
1984 1.55 0.85 2.63 54.645 3.040 1.667 5.159 
1985 1.60 0.85 2.23 56.581 3.031 1.610 4.225 
1986 1.07 0.51 1.50 57.805 1.984 0.946 2.781 
1987 1.21 0.57 1.94 59.649 2.174 1.024 3.486 
1988 1.13 0.54 2.54 61.973 1.954 0.934 4.393 
1989 1.23 0.61 2.36 64.640 2.040 1.012 3.913 
1990 1.35 0.75 2.28 67.439 2.146 1.192 3.624 
1991 1.27 0.69 2.37 69.651 1.954 1.062 3.647 
1992 1.33 0.64 2.07 71.493 1.994 0.960 3.103 
1993 1.16 0.59 2.50 73.278 1.697 0.863 3.657 
1994 1.19 0.56 2.26 74.802 1.705 0.802 3.238 
1995 1.15 0.59 3.24 76.354 1.614 0.828 4.548 
1996 1.35 0.69 2.71 77.980 1.856 0.948 3.725 
1997 1.15 0.55 2.43 79.326 1.554 0.743 3.283 
1998 1.05 0.43 1.94 79.934 1.408 0.577 2.601 
1999 0.98 0.59 1.82 81.109 1.295 0.780 2.405 
2000 1.35 0.93 1.85 83.128 1.741 1.199 2.385 
2001 1.48 0.88 1.97 84.731 1.872 1.113 2.492 
2002 1.12 0.81 2.32 85.872 1.398 1.011 2.896 
2003 1.35 0.98 2.42 87.573 1.652 1.199 2.962 
2004 1.69 1.25 2.06 89.703 2.019 1.494 2.462 
2005 1.80 1.66 2.00 92.260 2.091 1.929 2.324 
2006 2.58 1.94 3.04 94.728 2.919 2.195 3.440 
2007 2.24 2.23 4.20 97.099 2.473 2.462 4.636 
2008 2.47 2.57 4.06 100.063 2.646 2.753 4.349 
2009 1.79 1.76 3.55 100.000 1.919 1.886 3.805 
2010 1.93 2.17 5.18 101.654 2.035 2.288 5.462 
2011 2.70 2.90 6.22 104.086 2.780 2.986 6.405 
2012 2.37 2.95 6.89 106.009 2.396 2.983 6.967 
2013 2.47 2.90 4.50 107.187 2.470 2.900 4.500 


12.3.2 Quantile Regression 


The least absolute deviation estimator in Equation 12.72 provides a transition 
to the Quantile Regression estimator. Specifically, following Koenker and 
Bassett [27] we can rewrite the estimator in Equation 12.72 as 


min} D7 8la—asl+ D7 (1-8) lye — 28 (12.78) 


teti:yi > Ui Bh tet:yi < ai Bh 
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TABLE 12.6 
Least Absolute Deviation Estimates of Ethanol Price 


M-Robust Estimator 


OLS LAD Q(r=0.50) — 1.25 T5 
Bo 0.952 1.254 [254 1.045 T.028 
(0.245) (0.245) (0.220) (0.216) 
B, 0.308 0.325 0.325 0.338 0.335 
(0.136) (0.135) (0.114) (0.114) 
Bo 0.189 0.090 0.090 0.137 0.145 
(0.079) (0.079) (0.073) (0.072) 


TABLE 12.7 
Quantile Regression Estimates for Ethanol Prices 


Quantile Regression (rT) 
Parameter 0.2 0.4 0.6 0.8 OLS LAD 
Bo 1.1085 1.1729 1.1481 0.4874 0.9519 1.2537 
Lower Bound 0.6278 ~=—-0.5184 (0.1691 —0.0264 0.4504 0.7526 


Upper Bound 1.2898 1.5508 1.5706 0.8447 1.4535 1.7548 


By 0.3248 0.3962 0.3112 0.6733 0.3084 0.3254 
Lower Bound —0.1656 0.2543 0.0840 0.1721 0.0313 0.0485 


Upper Bound (0.5374 ~—(0.5180 (0.8321 1.3640 0.5855 0.6023 


fen 0.0520 0.0662 0.1475 0.2772 0.1886 0.0902 
Lower Bound —0.0469 —0.0523 0.0435 0.2413 0.0266 —0.0717 


Upper Bound 0.2925 0.4022 0.3903 0.4312 0.3506 0.2521 


where 0 = 0.5. Intuitively, the first sum in Equation 12.78 corresponds to 
the observations where y; > 2,8 or the observation is above the regression 
relationship, while the second sum corresponds to the observations where y; < 
x;@ or the observation is below the regression relationship. 

Generalizing this relationship slightly, 


Q(r)=min} So rh aisl+ SO O=7)le-a6l] + 8O) 
i€i:ys> Uzi B tEi:yi <2xiB 
(12.79) 
for any 7 € (0,1). 

Turning to the effect of the price of gasoline and corn on ethanol prices 
reported in Table 12.7, the effect of gasoline on ethanol prices appears to 
be the same for quantiles 0.2 through 0.6. However, the effect of gasoline on 
ethanol prices increases significantly at the 0.8 quantile. On the other hand, 
the regression coefficients for the price of corn on ethanol prices increase rather 
steadily throughout the entire sample range. 
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TABLE 12.8 
Quantile Regression on Factors Affecting Farmland Values 


Selected Quantiles 


Coefficient 20 40 50 70 90 
Constant AL.71*** 1172.27*** 1395.03*** 1868.91*** 3164.32*** 
(64.78)* (56.44) (56.57) (139.98) (258.88) 
Cash Income/ 0.004 0.07** 0.10*** 0.25*** 0.64*** 
Owned Acre (0.010) (0.03) (0.027) (0.043) (0.126) 
CRP/ 0.08 —0.78 —1.06* —1.59***  —2.89*** 
Owned Acre (0.325) (0.492) (0.603) ~—- (0.355) ~— (1.119) 
Direct Payment / —0.06 —.34* 0.46*** 1.00*** 1.85*** 


Owned Acre (0.084) (0.212) (0.167) (0.24) (0.94) 
Indirect Payment/ 0.01 0.006 0.07** 0.28** 0.74** 
Owned Acre (0.058) (0.162) (0.030) (0.090) (0.34) 
Off-Farm Income/ —_0.10*« 0.24*** 0.39*** 0.86*** 212*** 
Owned Acre (0.021) (0.033) (0.042) (0.080) (0.190) 
we ** and * denote statistical significance at the 0.01, 0.05, and 
0.10 levels of confidence, respectively. ° Numbers in parenthesis denote 
standard errors. 


Quantile regressors are sometimes used to develop the distributional effect 
of an economic policy. Mishra and Moss [31] estimate the effect of off-farm 
income on farmland values using a quantile regression approach. Table 12.8 
presents some of the regression estimates for selected quantiles. These results 
indicate that the amount that each household is willing to pay for farmland 
increases with the quantile of the regression. In addition, the effect of govern- 
ment payments on farmland values increases with the quantile. 


12.4 Chapter Summary 


e One of the factors contributing to the popularity of linear econometric 
models such as those presented in Chapter 11 is their simplicity. As long 
as the independent variables are not linearly dependent, 6 = (X'X) X’y 
exists and is relatively simple to compute. 


e The advances in computer technology and algorithms have increased the 
use of nonlinear estimation techniques. These techniques largely involve 
iterative optimization algorithms. 


Nonlinear least squares allows for flexible specifications of functions and 
distributions (i.e., the Cobb-Douglas production function can be estimated 
without assuming that the residuals are log-normal). 


e There are similarities between nonlinear maximum likelihood and nonlin- 
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ear least squares. One of the differences is the derivation of the variance 
matrix for the parameters. 


e The development of the Gibbs sampler and other simulation techniques 
provides for the estimation of a variety of priors and sampling probabilities. 
Specifically, we are no longer bound to conjugate families with simple 
closed form solutions. 


e Least absolute deviation models provide one alternative to the traditional 
concept of minimizing the squared error of the residual. 


e The least absolute deviation is a special case of the quantile regression 
formulation. 


e Another weighting of the residual is the M-robust estimator proposed by 
Huber [19]. 


12.5 Review Questions 


12-1R. What information is required to estimate either nonlinear least squares 
or maximum likelihood using Newton—Raphson? 


12-2R. How do the least absolute deviation and M-robust estimators reduce 
the effect of outliers? 


12.6 Numerical Exercises 


12-1E. Estimate the Cobb-Douglas production function for three inputs given 
the data in Table 12.1 using nonlinear least squares. Compare the re- 
sults to the linear transformation using the same data. Are the results 
close? 


12-2E. Using the 1960 through 1965 data for the interest rate paid by Al- 
abama farmers in Appendix A, construct a Bayesian estimator for 
the average interest rate given the prior distribution for the mean is 
N (0.05, 0.005). 


12-3E. Given the setup in Exercise 12-2E, use simulation to derive an empir- 
ical Bayesian estimate of the average interest rate. 


12-4E. Estimate the effect of the market interest rate (R;) and changes in the 
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debt to asset level (A (D/A),) on the interest rate paid by farmers in 
South Carolina (r;), 


Tt = a0 + a Ry + aA (D/A), + €¢ (12.80) 


using least absolute deviation and ordinary least squares. How differ- 
ent are the results? 


13 


Conclusions 


As I stated in Chapter 1, one of the biggest problems in econometrics is that 
students have a tendency to learn statistics and econometrics as a set of tools. 
Typically, they do not see the unifying themes involved in quantifying sample 
information to make inferences. To introduce this concept, Chapter 1 starts 
by reviewing how we think of science in general and economics as a science. 
In addition to the standard concepts of economics as a science, Chapter 1 
also introduces econometrics as a policy tool and framework for economic 
decisions. 

Chapter 2 develops the concept of probabilities, starting with a brief dis- 
cussion of the history of probability. I have attempted to maintain these de- 
bates in the text, introducing not only the dichotomy between classical and 
Bayesian ideas about probability, but also introducing the development of 
Huygens, Savage and de Finetti, and Kolmogorov. 

Chapter 3 then builds on the concepts of probability introduced in Chap- 
ter 2. Chapter 3 develops probability with a brief introduction to the concept 
of measure theory. I do not dwell on it, but I have seen significant discussions 
of measure theory in econometric literature — typically in the time series lit- 
erature. After this somewhat abstract introduction, Chapter 3 proceeds with 
a fairly standard development of probability density functions. The chapter 
includes a fairly rigorous development of the normal distribution — including 
the development of trigonometric transformations in Appendix C. 

Chapter 4 presents the implications of the probability measures developed 
in Chapter 3 on the moments of the distribution. Most students identify with 
the first and second moments of distribution because of their prominence 
in introductory statistical classes. Beginning with the anomaly that the first 
moment (i.e., the mean) need not exist for some distributions such as the 
Cauchy distribution, this chapter develops the notion of boundedness of an 
expectation. Chapter 4 also introduces the notion of sample, population, and 
theoretical moments. In our discussion of the sample versus population mo- 
ments, we then develop the cross-moment of covariance and its normalized 
version, the correlation coefficient. The covariance coefficient then provides 
our initial development of the least squares estimator. 

The development of the binomial and normal distributions are then pre- 
sented in Chapter 5. In this chapter, we demonstrate how the binomial dis- 
tribution converges to the normal as the sample size increases. After linking 
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these distributions, Chapter 5 generalizes the univariate normal distribution 
to first the bivariate and then the more general multivariate normal. The ex- 
tension of the univariate normal to the bivariate normal is used to develop 
the role of the correlation coefficient. 

Given these foundations, the next part of the textbook develops the con- 
cepts of estimation. Chapter 6 presents the concept of large samples based 
on the notion of convergence. In its simplest form, the various modes of con- 
vergence give us a basis for saying that sample parameters will approach (or 
converge to) the true population parameters. Convergence is important for 
a variety or reasons, but in econometrics we are often interested in the con- 
vergence properties of ordinary least squares estimates. In general, Chapter 
6 demonstrates that ordinary least squares coefficients converge to their true 
values as the sample size becomes large. In addition, Chapter 6 develops the 
central limit theorem, which states that the linear estimators such as ordinary 
least squares estimators are asymptotically distributed normal. 

Chapter 7 focuses on the development of point estimators — the estimation 
of parameters of distributions or functions of parameters of distribution. As a 
starting point, the chapter introduces the concept of the sample as an image 
of the population. Given that the sample is the image of the population, we 
can then use the estimates of the sample parameters to make inferences about 
the population parameters. As a starting point, Chapter 7 considers some 
familiar estimators such as the sample mean and variance. Given these esti- 
mators, we introduce the concept of a measure of the goodness of an estimator. 
Specifically, Chapter 7 highlights that an estimator is a random variable that 
provides a value for some statistic (i.e., the value of some parameter of the 
distribution function). Given that the estimator is a random variable with a 
distribution, we need to construct a measure of closeness to describe how well 
the estimator represents the sample. Chapter 7 makes extensive use of two 
such measures — mean squared error of the estimator and the likelihood of the 
sample. Using these criteria, the chapter develops the point estimators. 

Given that most estimators are continuous variables, the probability that 
any estimate is correct is actually zero. As a result, most empirical studies 
present confidence intervals (i.e., ranges of values that are hypothesized to 
contain the true parameter value with some level of statistical confidence). 
Chapter 8 develops the mathematics of confidence intervals. 

Chapter 9 presents the mathematical basis for testing statistical hypothe- 
ses. The dominant themes in this chapter include the development of Type I 
and Type II errors and the relative power of hypothesis tests. In addition to 
the traditional frequentist approach to hypothesis testing, Chapter 9 presents 
an overview of the Neyman—Pearson Lemma, which provides for the selection 
of an interval of rejection with different combinations of Type I and Type II 
error. 

The third part of the textbook shifts from a general development of math- 
ematical statistics to focus on applications particularly popular in economics. 
To facilitate our development of these statistical methodologies, Chapter 10 
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provides a brief primer on matrix analysis. First, the chapter reviews the ba- 
sic matrix operations such as matrix addition, multiplication, and inversion. 
Using these basic notions, Chapter 10 turns to the definition of a vector space 
spanned by a linear system of equations. This development allows for the 
general development of a projection matrix which maps points from a more 
general space into a subspace. Basically, ordinary least squares is simply one 
formulation for such a projection. 

With this background, Chapter 11 develops the typical linear models ap- 
plied in econometrics. As a starting point, we prove that ordinary least squares 
is a best linear unbiased estimator of the typical linear relationship under ho- 
moscedasticity (i.e., the assumption that all the errors are independently and 
identically distributed). After this basic development, we generalize the model 
by first allowing the residuals to be heteroscedastic (i.e., have different vari- 
ances or allowing the variances to be correlated across observations). Given 
the possibility of heteroscedasticity, we develop the generalized least squares 
estimator. Next, we allow for the possibility that some of the regressors are 
endogenous (i.e., one of the “independent variables” is simultaneously deter- 
mined by the level of a dependent variable). To overcome endogeneity, we de- 
velop Theil’s two stage least squares and the instrumental variable approach. 
Finally, Chapter 11 develops the linear form of the generalized method of 
moments estimator. 

The textbook presents three nonlinear econometric techniques in Chap- 
ter 12. First, nonlinear least squares and numerical applications of maximum 
likelihood are developed. These extensions allow for more general formulations 
with fewer restrictions on the distribution of the residual. Next, the chapter 
develops a simple Bayesian estimation procedure using simulation to inte- 
grate the prior distribution times the sampling distribution. Simulation has 
significantly expanded the variety of models that can be practically estimated 
using Bayesian techniques. Finally, Chapter 12 examines the possibility of 
nonsmooth error functions such as the least absolute deviation estimator and 
quantile regression. The least absolute deviation estimator is a robust estima- 
tor — less sensitive to extreme outliers. On the other hand, quantile regression 
allows the researcher to examine the distributional differences in the results. 

In summation, this textbook has attempted to provide “bread crumbs” for 
the student to follow in developing a theoretical understanding of econometric 
applications. Basically, why are the estimated parameters from a regression 
distributed Student’s t or normal? What does it mean to develop a confidence 
interval? And, why is this the rejection region for a test? It is my hope that 
after reading this textbook, students will be at least partially liberated from 
the cookbook approach to econometrics. If they do not know the why for a 
specific result, at least they will have a general reason. 

There are some things that this book is not. It is not a stand-alone book 
on econometrics. Chapters 11 and 12 barely touch on the vast array of econo- 
metric techniques in current use. However, the basic concepts in this book 
form a foundation for serious study of this array of techniques. For example, 
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maximum likelihood is the basis for most models of limited dependent vari- 
ables where only two choices are possible. Similarly, the section on matrix 
algebra and the mathematics of projection matrices under normality form the 
basis for many time series technique such as Johansen’s [21] error correction 
model. This book is intended as an introductory text to provide a foundation 
for further econometric adventures. 
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Beginning in the late 1980s, several computer programs emerged to solve sym- 
bolic and numeric representations of algebraic expressions. In this appendix, 
we briefly introduce two such computer programs — Maxima and Mathematica. 


A.1l Maxima 


Maxima is a an open-source code for symbolic analysis. For example, suppose 
we are interested in solving for the zeros of 


7 (x — 4)? 
TAPER ae 


and then plotting the function. To accomplish this we write a batch (ASCII) 
file: 


(A.1) 


DEBROSSE IK aK / 
/* Setup the simple quadratic function */ 


f (x) :=8-(x-4)°2/2; 

/* Solve for those points where f(x) = 0 */ 
solve (f(x)=0,x); 

/* Plot the simple function */ 
plot2d(f (x), [x,0,8]); 

DEBRA GOOC SAGO SOOO GRIGG ICI AK ak / 
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8-(x-4) 2/2 


FIGURE A.1 
Maxima Plot of Simple Quadratic Function. 


Executing this batch file yields: 


(%i2) £(x) :=8-(x-4) 72/2 
(402) £(x) :=8-(x-4)72/2 
(4i3) solve(f(x) = 0,x) 
(403) [x=0,x=8] 

(hi4) plot2d(f (x), [x,0,8]) 


and the graphic output presented in Figure A.1. The output (%03) presents 
the solutions to f (x) =O as «=0 and x =8. 

Next, consider using Maxima to compute the mean, variance, skewness, 
and kurtosis of the simple quadratic distribution. The Maxima code for these 
operations becomes: 


PERO BBG / 
/* Integrating f(x) for 0 to 8 */ 


f (x) :=8-(x-4) 72/2; 
k: integrate (f(x) ,x,0,8); 
/* Setup a valid probability density function */ 


g(x) := £(x)/k; 
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/* Verify that the new function integrates to one */ 
integrate (g(x) ,x,0,8); 

/* Compute the expected value of the distribution */ 
avg: integrate (x*g(x) ,x,0,8); 

/* Compute the variance of the distribution */ 
var: integrate((x-avg) “2*g(x) ,x,0,8); 

/* Compute the skewness of the distribution */ 
skw: integrate((x-avg) ~3*g(x) ,x,0,8); 

/* Compute the kurtosis of the distribution */ 
krt: integrate((x-avg) ~4*g(x) ,x,0,8); 


DECRG COAG COORG AGRG AAR OK/ 
The output for these computations is then: 


(%i2) f(x) :=8-(x-4) 72/2 

(%o2) £(x) :=8-(x-4) 72/2 

(%i3) k: integrate (f(x) ,x,0,8) 

(403) 128/3 

(hi4) g(x) :=£(x)/k 

(ho4) g(x) :=£(x)/k 

(%4i5) integrate(g(x) ,x,0,8) 

(405) 1 

(hi6) avg: integrate (x*g(x) ,x,0,8) 

(%06) 4 

(Zi7) var:integrate((x-avg) “2*g(x) ,x,0,8) 
(ho7) 16/5 

(%i8) skw: integrate ((x-avg) “3*g(x) ,x,0,8) 
(408) 0 

(4i9) krt:integrate((x-avg) “4*g(x) ,x,0,8) 
(09) 768/35 


From this output we see that the constant of integration is k = 128/3. The 
valid probability density function for the quadratic is then 


3 (s- es) 
f(2)= , (A.2) 


128 
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As demonstrated in the Maxima output fe f (a) dx = 1 so f (a) is a valid 
probability density function. Next, we develop the Maxima code to derive the 
cumulative density function: 


DECOR AGO OCG OOOO GIR I IK a / 
/* Starting with the valid probability density function */ 


f (x) :=3/128* (8-(x-4)*2/2) ; 


/* We derive the general form of the cumulative density */ 
/* function */ 


ri: integrate(f(z),z,0,x); 
/* We can then plot the cumulative density function */ 
plot2d(ri, [x,0,8]); 


/* Next, assume that we want to compute the value of the */ 
/* cumulative density function at x = 3.5 */ 


subst (x=3.5,r1); 
DOO OCGA AI I I AK KK 4 3k / 
The result of this code is then: 


(%i2) £ (x) :=3*(8-(x-4) 72/2) /128 
(ho2) £ (x) :=(3*(8-(x-4) *2/2) )/128 
(%4i3) ri:integrate (f(z) ,z,0,x) 
(%03) -(x73-12*x*2) /256 

(4i4) plot2d(r1, [x,0,8]) 

(%o4) 

(4i5) subst(x = 3.5,r1) 

(405) 0.40673828125 


with the plot depicted in Figure A.2. 


A.2. Mathematica™ 


Mathematica is a proprietary program from Wolfram Research. To use Math- 
ematica efficiently, the user writes a Notebook file using Mathematica’s fron- 
tend program. In this section, we will present some of the commands and 
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~(KA3-12*xA2)/256 


FIGURE A.2 
Maxima Cumulative Distribution Function for Quadratic Distribution. 


responses from Mathematica. We will not discuss the frontend program in 
detail. 

As a starting point, consider the same set of operations from our Maxima 
example. We start by defining the quadratic function 


flx_] := 8 - (x - 4)72/2; 
Print [f[x]]; 


Mathematica responds with the general form of the function (i.e., from the 
Print command): 


8-1/2 (-4+x)*2 


The input command and resulting output for solving for the zeros of the 
quadratic function is then: 


sol1 = Solve[f[x] == 0, x]; 
Print[sol1]; 


{{x->0},{x->8}} 
We can then generate the plot of the quadratic function: 
Print [Plot [f [x] ,{x,0,8}]]; 


The plot for this input is presented in Figure A.3. Following the Maxima 
example, we integrate the quadratic function over its entire range to derive 
the normalization factor: 
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8-(x-4) 2/2 


FIGURE A.3 
Mathematica Plot of Simple Quadratic Function. 


k = Integrate[f[x], {x, 0, 8}]; 
Print [k]; 

glx_] := flx]/k; 

Print [g[x]]; 


128/3 
3/128 (8-1/2 (-4+x)*2) 


Next we test the constant of integration and compute the average, variance, 
skewness, and kurtosis: 


tst = Integrate[g[x], {x, 0, 8}]; 
Print[tst] ; 
1 


avg = Integrate[x glx], {x, 0, 8}]; 
Print [avg] ; 
4 


var = Integrate[(x - avg)°2 glx], {x, 0, 8}]; 
Print [var] ; 
16/5 


skw = Integrate[(x - avg)*3 glx], {x, 0, 83]; 
Print [skw] ; 
0 
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FIGURE A.4 
Mathematica Cumulative Distribution Function for Quadratic Distribution. 


krt = Integrate[(x - avg)*4 glx], {x, 0, 8}]; 
Print [krt] ; 
768/35 


To finish the example, we derive the cumulative density function by integrating 
the probability density function. The graph of the cumulative density function 
is shown in Figure A.4. 


h[x_] := Integrate[g[z], {z, 0, x}]; 
Print [h[x]]; 


(3 x72) /64-x73/256 


Appendix B 


Change of Variables for Simultaneous 
Equations 


CONTENTS 
B.1 Linear Change in Variables ............. 00. cece cece eee eee 308 
B.2 Estimating a System of Equations ................. 0. cece eee 311 


One of the more vexing problems in econometrics involves the estimation of 
simultaneous equations. Developing a simple model of supply and demand, 


ds = 10 + Q41p ay2p" ay3w + €1 B1 
~ 2 Be appar (B.1) 
dp = 20 + Q21p + Q24p a25¥ + €2 


where qzg is the quantity supplied, gp is the quantity demanded, p is the price, 
p“ is the price of an alternative good to be consided in the production (i.e., 
another good that the firm could produce using its resources), p? is the price 
of an alternative good in the consumption equation (i.e., a complement or 
substitute in consumption), w is the price of an input, Y is consumer income, 
€, is the error in the supply relationship, and e€2 is the error in the demand 
relationship. In estimating these relationships, we impose the market clearing 
condition that qg = qp. As developed in Section 11.4.2, a primary problem 
with this formulation is the correlation between the price in each equation and 
the respective residual. To demonstrate this correlation we solve the supply 
and demand relationships in Equation B.1 to yield 


Q10 axop* a13w + dg9 + gap? + agsY — 1 + €2 (B.2) 


Pp —= 
O11 — O21 
Given the results in Equation B.2, it is apparent that E[pe,], E[pe2] + 0. 
Thus, the traditional least squares estimator cannot be unbiased or asymp- 
totically consistent. The simultaneous equations bias will not go away. 
Chapter 11 presents two least squares remedies for this problem — two stage 
least squares and instrumental variables. However, in this appendix we develop 
the likelihood for this simultaneous problem to develop the Full Information 
Maximum Likelihood (FIML) estimator. 
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B.1 Linear Change in Variables 


To develop the FIML estimator we return to the linear change in variables 
notion developed in Theorem 3.36. The transformation envisioned in Chapter 
3 is 
Y1 = a11X1 + a12X2 = X1 = 611Y1 + bi2Yo (B.3) 
Y2 = agi X1 + aXe XQ = ba1V1 + b22¥o. 


Given this formulation, the contention was Equation 3.94: 


f (b11y1 + br2ye, ba1y1 + be2y2) 
|a41@22 = 12421 | 


g (1, y2) = 


The real question is then, what does this mean? The overall concept is that I 
know (or at least hypothesize that I know) the distribution of X; and X2 and 
want to use it to derive the distribution of a function of Y; and Y3. 

Consider a simple example of the mapping in Equation B.3: 


1 
Yi = Xa + 5X2 
3 
Yo = 5X2 


such that f (X1, X2) = 1. Applying Equation 3.94, the new probability density 
function (i.e., in terms of Y; and Y2) is then 


1 
9 (V1, Yo) = fix 3) 


(B.4) 


(B.5) 
1x - 

4 
The problem is that the range of Y; and Y2 (depicted in Figure B.1) is not 
Yi, Y2 € [0,1]. Specifically Y2 € [0,3], but the range of Y; is determined by 
the linear relationship Y; = X, — 3X9. This range is a line with an intercept 
at Y; = 0 and a slope of —2. In order to develop the ranges, consider the 
solution of Equation B.4 in terms of X; and X92: 


2 
X,=Y,- 3 2 
4 (B.6) 
X92 — 32 
First we solve for the bounds of Y,: 
2 2 2 
Sy ere and X; € O,jj>VYie —3¥2,1- 3¥e (B.7) 


If you want to concentrate on the “inside integral,” 


ee dY, = : (2 5%) ( =¥2) =i (B.8) 
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as SEAS 


“a 


FIGURE B.1 
Transformed Range of Y; and Y2. 


The “outside integral” is then determined by 


4 3 
Y= 342 and X2 € [0,1] > Y2€ o, ;| : (B.9) 
Integrating over the result from Equation B.8 then gives 
3 
4 fa 413 
= dY, == |--0]=1. B.10 
sf a =5 (7-9 (B.10) 
The complete form of the integral is then 
1-2Y2 
ae ae dY\dY2 =1. (B.11) 


The implication of this discussion is simply that the probability function is 
valid. 

Next, consider the scenario where the f (X1, X2) is a correlated normal 
distribution: 


f (Xa, Xal a, B) = (2m) EI Sn | s(| x |- “yo((2 ]-s)]. 


(B.12) 
In order to simplify our discussion, let 4 = 0 and & = Ig. This simplifies the 
distribution function in Equation B.12 to 


$C6.20)2 ON Se -3( L Dak c ) (B.13) 
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FIGURE B.2 
Transformed Normal Distribution. 


Next, consider rewriting f (X1, X2) (in Equation B.13) as g (Yi, Yo). Following 
Equation 3.94 yields 


Vesey} yey 
9 (Yi, ¥a) = -—gy (2) exp -3 ( “By ) ( ‘ays ) (B.14) 
py - 
which is presented graphically in Figure B.2. 
Next, we redevelop the results in matrix terms. 
2 
Ever || a ]ex=r™ (B.15) 
Substituting Equation B.15 into Equation B.14 yields 
1 = 
g (¥i,¥a) = ay ny? 
1x - 
pay 
i oe hee ah see) rae 
aoe 3 3 
eol-o((6 F][H)) Co FILE 
1 (B.16) 
> (2n)~* 
3 
1x - 
xg 
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Using the X =TY form in Equation B.15, we can rewrite Equation B.16 as 
9 (V1, Y2) = (2) * exp [-5yrry| iaieioe (B.17) 

Going back to the slightly more general formulation, 
g(%, Yo) = (Qn)? \D|72 exp |-5yrecry| leis (B.18) 


where p is the number of variables in the multivariate distribution. 


ee 
B.2 Estimating a System of Equations 
Next, we return to an empirical version of Equation B.1. 


ds = 115.0 + 6.0p — 2.5p4 — 12.0w + 4 


da = 115.5 — 10.0p + 3.3333p? + 0.00095Y +4 eg. ve) 
We can write Equation B.19 in the general form 
TY =AX+e3Y-T'AX =I 'e (B.20) 
as 
1 
pA 
1 —6.0 q |_| 115.0 -—2.5 -12.0 0.0 0.0 es 
1 10.0 p| | 115.5 0.0 0.0 3.3333 0.00095 2 
Y 
La]; 
€2 
(B.21) 


Using this specification, we return to the general likelihood function implied 
by Equation B.18. 


f (Y) = (20)? ||"? 
x exp -5 art Ag i Sth ie rx] hae ee 
(B.22) 


Since Vee = |T|, Equation B.22 can be rewritten as 


f(Y) = (2m) ||"? 
B.23 
x exp -5 ara ys ee) [y-rax]| IT}. a 
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Consider the simulated data for Equation B.21 presented in Table B.1. 
We estimate this system of equations using an iterative technique similar to 
those presented in Appendix D. Specifically, we use a computer algorithm that 
starts with an estimated set of parameters and then attempts to compute a 
set of parameters closer to the optimal. As an additional point of complexity, 
Equation B.23 includes a set of variance parameters (©). In our case the © ma- 
trix involves three additional parameters (011, 012, and o22). While we could 
search over these parameters, it is simpler to concentrate these parameters 
out of the distribution function. Notice that these parameters are the vari- 
ance coefficients for the untransformed model (i.e., the model as presented in 
Equation B.19). Hence, we can derive the estimated variance matrix for each 
value of parameters as 


A 

q— 10 — A11p — 2p Q13W 
B 

gq — A29 — Q21p — A24p a25Y 


(B.24) 
A 1 
U(a) = ne (a) €(a). 
Further simplifying the optimization problem by taking the natural logarithm 
and discarding the constant yields 


a= {Q10, O11, 12, 13, 20, M21, A24, 25 } 
1 Ay 
T = 
1 —Q21 


Aa| HO O12 413 0 0 
a29 «(COO 0 a24 a5 
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TABLE B.1 
Data Set for Full Information Maximum Likelihood 
Obs. Q Dp pA w pe y 
1 101.013 5.3181 5.7843 2.6074 8.7388 1007.20 
2 91.385 5.6342 6.1437 3.5244 6.8113 1005.17 
3 103.315 4.6487 6.4616 1.9669 7.4628 1001.18 
4 105.344 4.1317 6.1897 1.6157 6.5012 1002.51 
5 98.677 4.8446 6.2839 2.3121 6.5719 1006.60 
6 106.015 4.6119 5.9365 1.7852 8.1629 990.01 
7 100.507 4.8681 5.2442 2.6392 7.2785 1001.22 
8 97.746 5.1389 5.5195 2.8337 7.2162 1003.83 
9 104.609 4.5759 4.9144 2.1443 7.5860 1011.73 
10 = 100.274 5.2478 6.4346 2.4840 8.3292 999.61 
11 =105.011 4.3364 4.8738 1.9210 7.0101 997.99 
12 99.311 5.0234 5.9073 2.5028 7.3492 998.36 
13 «100.1383 4.9524 5.6071 2.5546 7.4004 1001.29 
14 99.163 4.8527 7.0464 2.3681 6.8731 986.34 
15 100.986 4.9196 6.0364 2.3506 7.5375 1001.05 
16 97.317 5.5198 6.9735 2.84380 8.2804 999.71 
17 101.919 4.5613 5.2070 2.3069 6.7528 999.88 
18 98.571 5.2390 6.1843 2.7258 7.7439 1014.08 
19 97.835 5.8442 5.8252 3.2269 9.4427 979.63 
20 93.062 5.3741 4.5755 3.4441 6.4937 1004.44 
21 96.886 5.5546 5.5188 3.1334 8.1915 1004.27 
22 102.866 4.7970 6.0927 2.0457 7.7434 992.51 
23 101.721 4.7528 5.1983 2.4157 7.2560 1007.71 
24 98.433 5.3295 6.4392 2.5720 8.0082 992.39 
25 95.441 5.3400 6.1446 2.9980 7.1166 1021.44 
26 =101.268 4.7685 5.5490 2.3263 7.1841 996.66 
27 100.197 5.1127 6.9770 2.2818 7.8262 1016.34 
28 96.613 4.6348 6.9509 2.3346 5.3826 991.89 
29 97.385 4.9461 6.7019 2.5715 6.5967 986.52 
30 103.915 4.6871 6.3564 1.9521 7.7442 1001.89 
31 95.602 5.1170 6.4526 2.8207 6.5441 990.76 
32 102.136 4.7981 7.0438 2.0735 7.5421 1005.92 
33 93.416 5.4138 6.6783 3.1374 6.7559 1001.24 
34 95.716 5.4084 6.3358 2.9980 7.4294 998.78 
35 94.905 5.0263 6.0993 3.0018 6.0674 1004.53 
36 99.439 4.7399 6.2888 2.3126 6.5385 1005.63 
37 ~=110.629 3.9795 5.7451 1.1964 7.6438 1003.48 
38 100.805 4.5947 5.6510 2.3125 6.4995 1008.28 
39 100.951 5.1130 5.0394 2.6601 8.1279 1006.28 
40 104.540 4.4065 5.7951 1.8974 7.0803 994.11 
41 103.274 4.8168 6.5948 2.1816 7.9596 1006.27 
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Obs. Q p ps w pe y 


42 94.083 5.2631 5.1958 3.3148 6.5127 996.84 
43 99.577 5.3378 6.0798 2.7206 8.4150 992.78 
44 96.338 4.9589 5.9935 2.8168 6.2990 990.47 
45 104.351 4.7474 5.3707 2.1632 8.0089 1011.81 
46 100.210 4.9229 6.6945 2.3060 7.3371 994.40 
47 94.072 5.7720 5.5735 3.4364 8.0373 995.79 
48 100.460 5.1441 5.38778 2.7249 8.1262 989.14 
49 99.129 5.3715 6.4842 2.6675 8.3774 997.82 
50 102.551 4.8758 6.2919 2.1630 7.8877 997.69 
51 97.505 4.7343 6.3408 2.5109 5.9771 989.90 
52 97.932 5.2163 6.2582 2.8204 7.5332 1010.31 
53 100.669 5.2442 6.6273 2.4942 8.4434 1012.06 
54 102.016 5.1210 4.8340 2.6142 8.4602 999.90 
55 102.327 4.6689 6.6954 1.9703 7.1832 1007.31 
56 96.758 5.3178 6.8145 2.7339 7.4476 1002.68 
57 100.947 5.1529 5.8795 2.5656 8.2317 1007.66 
58 93.441 5.7598 5.1249 3.6586 7.8207 1006.79 
59 99.241 5.1567 6.9333 2.4682 7.7528 996.58 
60 98.562 4.6966 6.7658 2.2488 6.1362 1008.28 
61 109.635 4.3500 5.6216 1.4074 8.4442 998.47 
62 97.033 5.2723 6.8122 2.7122 7.3950 1014.60 
63 98.975 4.7467 6.1962 2.3677 6.4070 1001.00 
64 101.968 4.8730 5.3569 2.5012 7.7700 995.71 
65 104.448 4.5595 6.2224 1.8166 7.4865 998.85 
66 101.025 5.1205 6.8176 2.3072 8.1799 993.81 
67 104.936 4.3629 5.5058 1.9391 7.0954 1007.82 
68 106.865 4.0540 5.5562 1.5484 6.7380 993.36 
69 97.024 4.9168 5.4087 2.8418 6.3121 1007.82 
70 97.513 5.1881 6.3041 2.7227 7.2752 1010.94 
71 89.772 5.9486 7.7699 3.4693 7.2697 1003.00 
72 105.279 4.3413 5.2095 1.9837 7.1083 1000.92 
73 106.535 4.1015 6.0280 1.5198 6.7768 994.72 
74 97.998 5.2607 5.7413 2.8433 7.6496 1006.89 
75 101.167 4.8004 6.1460 2.2469 7.2899 983.29 
76 97.666 4.9058 5.7852 2.6019 6.5298 991.44 
77 ~=102.534 4.6082 5.0741 2.3579 7.1192 995.70 
78 101.353 4.5451 6.0791 2.1908 6.5430 1001.58 
79 95.475 5.8495 5.4118 3.4785 8.7392 988.47 
80 95.387 5.0495 5.9921 2.9921 6.2858 1000.46 
81 102.550 4.9486 7.3503 2.1588 8.1570 998.02 
82 99.617 5.1257 4.7759 2.9729 7.7619 1003.14 
83 98.207 4.8788 5.9363 2.5928 6.5777 1004.16 
Continued on Next Page 
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A 


B 


Obs. Q Dp p w Dp y 

84 107.700 4.3114 6.3258 1.4821 7.7802 1001.08 
85 94.616 5.1705 6.5016 2.8974 6.4460 981.95 
86 96.412 5.3390 5.8664 2.9854 7.5201 980.00 
87 104.131 4.5881 5.6296 2.0724 7.5209 998.92 
88 98.171 4.8662 7.1715 2.3646 6.5489 1003.12 
89 100.330 4.7532 4.3557 2.7301 6.8322 1010.36 
90 102.046 4.5079 6.2594 1.9468 6.6121 997.16 
91 102.058 4.8524 7.2229 2.0469 7.6635 1008.76 
92 101.538 5.1086 6.6352 2.2712 8.2555 1007.75 
93 106.127 4.3825 5.9567 1.6748 7.4946 999.99 
94 96.395 4.9926 6.7225 2.6088 6.4035 988.35 
95 100.459 4.9249 5.8115 2.4600 7.4220 991.86 
96 97.746 5.1175 5.9726 2.5973 7.1797 986.47 
97 96.813 5.0942 6.7325 2.6480 6.8318 995.47 
98 98.187 5.0592 4.8647 2.9029 7.1851 982.71 
99 106.491 4.6461 6.2315 1.7326 8.3240 1014.81 
100 95.846 5.2445 5.8738 2.9156 6.9860 1003.44 
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# Read the data from a comma separated variables file 


dta <- read.csv("AppendixD.csv") 


# Setup the X matrix for each operation 


xx <- as.matrix(cbind(dta[,1] ,matrix(1,nrow=nrow(dta) ,1),dta[,2:6])) 


# Defining the maximum likelihood function 


ml <- function(b) { 


g <- cbind(rbind(1,1),rbind(-b[2] ,-b[6])) 
a <- cbind(rbind(b[1] ,b[5]) ,rbind(b[3] ,0) ,rbind(b[4] ,0), 
rbind(0,b[7]) ,rbind(0,b[8])) 
aa <- cbind(rbind(1,1),-rbind(b[1] ,b[5]) ,-rbind(b[2] ,b[6]), 
-rbind(b[3] ,0) ,-rbind(b[4] ,0) ,-rbind(0,b[7]), 
-rbind(0,b[8])) 
vv <- var(xx%/*%t (aa) ) 
for (i in 1:nrow(dta)) { 
err <- rbind(dta[i,1],dtali,2]) - solve(g)%*hah*/rbind(1, 
dtali,3],dtali,4] ,dtali,5],dtali,6]) 


li <- 1/2*log(det (vv) )+1/2*t (err) ,*h,solve (t (g))%*Z,solve (vv) 


A*hsolve(g)h*eherr - det(g) } 


return(1li) } 


# 
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TABLE B.2 
Full Information Maximum Likelihood Estimates 
Estimates 
Parameter a a” Q OLS 
Q10 115.000 115.677 116.495 116.460 
(12.379) (12.350) (1.321) 
ayy 6.000 5.320 5.222 5.327 
4.561) (4.582) (0.469) 
a12 —2.500 —2.414 —2.442 —2.392 
0.980) (0.992) (0.125) 
a13 —12.000  —11.507  —11.530 —11.489 
3.655) (3.670) (0.378) 
a20 115.500 115.732 115.602 116.600 
7.363) (7.358) (0.793) 
a2] —10.000 —9.991 —9.981 —9.979 
0.178) (0.178) (0.018) 
a4 —3.333 3.331 3.330 3.308 
0.101) (0.101) (0.010) 


0195 0.00095 0.00920 0.00918 0.00848 
(0.00535) (0.00535) (0.00079) 


HHHHHHHHHHHHHHHHHHEEHHHEHHHHHEHHHHEHHHHEH HEHEHE HEHEHE EHH BH EE 
# Call for the values used in the simulation as the initial values # 


bO <- rbind(115,6,-2.5,-12,115.5,-10,3.333,0.00095) 
res <- optim(b0,ml,control=list (maxit=3000) ) 

print (res) 

h  <- optimHess(res$par ,m1) 


HEPHHHAHHAEHHAE HARE RATHER HERRERA HARARE RAPHE HRA RAR A 
# Estimate ordinary least squares to use as initial values # 


bi <- Im(dta[,1] ~ dta[,2] + dta[,3] + dta[,4]) 

b2 <- Im(dta[,1] ~ dta[,2] + dta[,5] + dta[,6]) 

bO <- rbind(as.matrix(bi$coefficients) ,as.matrix(b2$coefficients) ) 

res <- optim(b0,ml,control=list (maxit=3000) ) 

print (res) 

h <- optimHess(res$par ,m1) 
HHEHHHHHHHHHHEHHHEHHHHHHHHEHEHHHHHHHHHHHHHHHHE HHH HEHEHE HH HEHE R HHH HH 


Table B.2 presents the true values of the parameters (i.e., those values 
used to simulate the data) along with three different sets of estimates. The 
first set of estimates (presented in column 3) uses the true values as the initial 
parameter values for the iterative techniques. The second set of estimates in 
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column 4 uses the ordinary least squares estimates in the fifth column as the 
initial parameter values. 

Each set of paramters is fairly close to the true values — even the ordi- 
nary least squares results. The primary difference between the ordinary least 
squares results is the relatively small standard deviation of the parameter es- 
timates. For example, the maximum likelihood estimate for the standard de- 
viation of a 13 is 3.670 in the fourth column compared to a standard deviation 
of 0.3878 under ordinary least squares. To compare the possible consequences 
of this difference, we construct the z values for the difference between the 
estimated value and the true value for the FIML and OLS estimates. 


|—11.000 + 11.530| _ _ |-11.000 + 11.489] 
3 670 0.128 < 1.354 = 0.378 


= ZOLS: 

(B.26) 
Hence, the confidence interval under ordinary least squares is smaller. How- 
ever, in this case that may not be a good thing since the estimated interval 
is less likely to include the true value because of the simultaneous equations 
bias. 


ZFIML = 
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Section 3.7 integrates the normal distribution by transforming from the stan- 
dard Cartesian space ({x,y}) into polar space ({r, 0}). In this appendix, we 
develop this transformation in a little more detail. 


C.1 Continuing the Example 


We continue with the formulation from Equation 3.103: 


2 
y= f(x) =5- = ce [-5,5]. 


Table C.1 presents the value of x in column 1, the value of f (2) in column 
2, and the radius (r) in column 3. Given these results, there are two ways 
to compute the inscribed angle (#). The way described in the text involves 
solving for the inverse of the tangent function: 


6 =tan (42) (Cl) 


Column four of Table C.1 gives the value of f (a) /a while column 5 presents 
the inverse tangent value (note that the values are in radians — typically 
radians are given in ratios of 7, so for « = 4.0 the value of the angle is 
0 = 0.4229 = 0.13467, as discussed in the text). The other method is implied 
by the cosine result in Equation 3.110: 


y =rcos (0) > 9 =cos! (42) (C.2) 


r 


Columns 6 and 7 of Table C.1 demonstrate this approach. 
Examining the results for @ in columns 5 and 7 of Table C.1, we see that 
the computed values of the inscribed angle are the same up until « = 0. 
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tan (A) 12.5 


-12.5 


Tangent Inverse Tangent 


FIGURE C.1 
Standard Table for Tangent. 


Consider a couple of things about that point. First, = 0 represents the 
90° angle — the tangent at this point is positive infinity from the right and 
negative infinity from the left. Second, the inverse functions for the tangent are 
typically defined on the interval 0 € (-&, z), as depicted in Figure C.1. Given 
that we are interested in the range 6 € [0, 7], we transform the inverse tangent 
in column 8 of Table B.1 by adding z to each value. Given this adjustment, 
the @s based on the tangents are the same as the @s based on the cosines. 
Figure C.2 presents the transformed relationship between r (0) and 0. 

Based on the graphical results in Figure C.2, we can make a variety of 
approximations to the original function. First, we could use the average value 
of the radius (7 = 4.6548). Alternatively, we can regress 


r (0) =o +19 +720? + 730°. (C.3) 


Figure C.3 presents each approximation in the original Cartesian space. The 
graph demonstrates that the linear approximation lies slightly beneath the 
cubic approximation. 

Given that we can approximate the function in a polar space, we demon- 
strate the integration of the transformed function. Using the result from Equa- 
tion 3.111, 

dydx = rdrdé 


we can write the integral of the transformed problem as 


wT 4.6548 
| i: rdrd0 = 34.0347. (C.4) 
0 0 
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FIGURE C.3 
Approximation in (x, y) Space. 


This result can be compared with the integral of the original function of 
5 2 
100 
/ (s = =) dz = — ~ 33.3333. (C.5) 
5 5 3 


Hence, the approximation is close to the true value of the integral. 
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Obviously the polar integral is much more difficult for the quadratic func- 
tion in Equation 3.103. The point of the exercise is that some functions such 
as the normal distribution function are more easily integrated in polar space. 
If the original function is not integrable, the polar space may yield a close ap- 
proximation. The most viable candidates for transformation involve the sum 
of squared terms — such as the exponent in the normal distribution. 


C.2 Fourier Approximation 


A related but slightly different formulation involving trigonometric forms is 
the Fourier approximation. To develop this approximation, consider the second 
order Taylor series expansion that we use to approximate a nonlinear function. 


f(x) =f(@)+ a face |+ 


TP PL@) 2F (0) 
£ | tm — X41 Ox? 0x2021 | t— 2X1 
5 f(x) f(x) t2— £2 |° 


O02 0x2 Ox 


Letting the first and second derivatives be constants and approximating 
around a fixed point (%), we derive a standard quadratic approximation to 
an unknown function. 


/ / 
a x x a a x 
re 1 ty 4 1 11 12 Psiil* (C.7) 
a2 r2 v2 a12 G22 v2 
We can solve for the coefficients in Equation C.7 that minimize the approxima- 
tion error to produce an approximation to any nonlinear function. However, 
the simple second order Taylor series expansion may not approximate some 


functions — the true function may be a third or higher order function. 
An alternative approximation is the Fourier approximation 


F(0) =00+ > [oncos (22) + ag sin (*)| (C.8) 


where k; is a periodicity. Returning to our example, we hypothesize two period- 
icities ky = 5/m and kp = 5/27. The approximations for the two periodicities 
are estimated by minimizing the squared difference between the function and 
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— Original ——Approximation1 = = Approximation 2 


FIGURE C.4 
Fourier Approximations. 


Equation C.8 and are presented in Figure C.4. Given that the original function 
is a quadratic, the second order Taylor series expansion is exact. However, the 
Fourier approximation is extremely flexible. A variant of this approximation 
is used in Theorem 6.24. 


Appendix D 


Farm Interest Rate Data 


TABLE D.1 


Farm Interest Rates for Southeastern United States 


abama orida eorgia outh Carolina aa 
Year Int Rate A(D/A) Int Rate A(D/A) Int Rate A(D/A) Int Rate A(D/A) Rate 
960 0.0553 0.1127 0.0511 —0.0588 0.0507 0.0789 0.0595 0.1232 0.0506 
961 0.0547 0.0501 0.0500 0.0771 0.0514 0.0008 0.0584 0.0427 0.0495 
962 0.0551 0.0580 0.0512 0.0864 0.0516 0.1405 0.0584 0.0793 0.0490 
963 0.0546 0.0278 0.0558 0.0651 0.052 0.0286 0.0589 0.0561 0.0474 
964 0.0537 0.0269 0.0586 0.0826 0.0516 0.0462 0.0602 0.0105 0.0472 
965 0.0543 0.0290 0.0509 0.0612 0.054 0.0324 0.0547 0.0378 0.0475 
966 0.0545 —0.0024 0.0517 0.1083 0.0550 0.0339 0.0549 0.0462 0.0551 
967 0.0551 0.0059 0.0542 0.0093 0.0580 0.0113 0.0585 —0.0232 0.0604 
968 0.0573 —0.0299 0.0556 0.0063 0.0603 —0.0362 0.0595 —0.0277 0.0671 
969 0.0601 0.0247 0.0575 —0.0277 0.063 0.0443 0.0628 0.1077 0.0752 
970 0.0609 —0.0486 0.0627 —0.0138 0.0669 —0.0079 0.0681 0.0292 0.0871 
971 0.0614 0.0166 0.0603 —0.0001 0.065 —0.0185 0.0650 —0.0145 0.0822 
972 0.0588 0.0791 0.0592 —0.0273 0.0624 0.0208 0.0647 0.0449 0.0784 
973 0.0630 —0.1264 0.0622 —0.0522 0.0653 —0.0443 0.0631 0.0483 0.0792 
974 0.0695 0.0566 0.0682 0.0987 0.073 0.0577 0.0753 0.0392 0.0907 
975 0.0700 0.0404 0.0716 —0.0057 0.0760 0.1529 0.0794 0.1296 0.1008 
976 0.0718 —0.0079 0.0720 —0.0459 0.0757 —0.0232 0.0750  —0.0006 0.0930 
977 0.0724 0.1041 0.0675 0.0208 0.0716 0.0539 0.0708 0.1041 0.0859 
978 0.0743 —0.0040 0.0706 —0.0417 0.0754 —0.0371 0.0861 —0.0805 0.0906 
979 0.0825 —0.0585 0.0768 —0.0094 0.0817 0.0485 0.0829 0.0075 0.1016 
980 0.0912 0.0273 0.0830 —0.0070 0.0889 0.0720 0.0907 0.0674 0.1281 
981 0.1014 0.1143 0.0877 0.1584 0.1028 0.1242 0.1018 0.1199 0.1488 
982 0.1081 0.0499 0.0946 0.0059 = 0.1133 0.0158 0.1129 0.0757 0.1494 
983 0.1059 0.0065 0.0950 0.0002 0.1134 0.0103 0.1139 0.0424 0.1271 
984 0.1017 0.0121 0.0900 0.0729 0.1122 —0.0143 0.1113 —0.0145 0.1327 
985 0.0903 —0.0327 0.0942 0.0100 0.1010 —0.0017 0.1030 0.0099 0.1197 
986 0.0928 —0.1189 0.0916 —0.0917 0.1044 —0.1526 0.1008 0.0116 0.0989 
987 0.0943 —0.1231 0.0982 —0.1512 0.0926 —0.0528 0.0961  —0.1878 0.1005 
988 0.0963 —0.1058 0.0971 —0.0111 0.095 —0.1124 0.0978  —0.2189 0.1028 
989 0.0993 —0.0577 0.0990 —0.0733 0.0987 —0.1397 0.0978 —0.1068 0.0969 
990 0.0962 —0.0414 0.0973 0.0214 0.0965 —0.0423 0.0908 —0.0842 0.0985 
991 0.0867 —0.0321 0.0915 0.0014 0.0887 0.0421 0.0848 —0.0705 0.0935 
992 0.0816 —0.0744 0.0842 —0.0003 0.0836 —0.0455 0.0800 —0.0210 0.0860 
993 0.0831 —0.0680 0.0757 0.0470 0.0772  —0.0241 0.0769 —0.0640 0.0763 
994 0.0835 —0.0427 0.0787 0.0366 0.0785 —0.0113 0.0784 —0.0468 0.0827 
995 0.0814 0.0472 0.0810 —0.0303 0.0768 0.0475 0.0775 —0.0131 0.0788 
996 0.0849 0.0292 0.0836 0.0409 0.0836 —0.0147 0.0815 0.0408 0.0775 
997 0.0830 0.0212 0.0789 0.0466 0.0801 0.0029 0.0791 —0.0102 0.0757 
998 0.0805 0.0518 0.0766 0.0512 0.0789 —0.0212 0.0770 0.0020 0.0697 
999 0.0794 0.0000 0.0752 0.0004 0.0790 —0.0909 0.0779 0.0521 0.0758 
2000 0.0794 0.0287 0.0756 —0.0053 0.0777 —0.0278 0.0673 0.1619 0.0803 
2001 0.0682 0.0169 0.0668 —0.0176 0.0671  —0.0100 0.0603 0.0272 0.0765 
2002 0.0629 0.0097 0.0628 —0.0151 0.0619 —0.0265 0.0611  —0.0019 0.0751 
2003 0.0509 —0.0229 0.0528 —0.0251 0.0506 —0.0319 0.0498 —0.0248 0.0655 
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As an introduction to numerical procedures for nonlinear least squares and 
maximum likelihood, this appendix presents the solution for the three param- 
eter Cobb-Douglas production function presented in Section 12.1. Then we 
present R code that solves the maximum likelihood formulation of the same 
problem. After the maximum likelihood example, the appendix then presents 
a simple application of Bayesian estimation using simulation. Finally, the ap- 
pendix presents the iterative solution to the Least Absolute Deviation problem 
using a conjugate gradient approach. 


E.1 Hessian Matrix of Three-Parameter Cobb—Douglas 


As a starting point, we derive the Hessian matrix (Equation 12.15) for the 
three parameter Cobb-Douglas production function. Notice that by Young’s 
theorem (i.e., 0? f /0x,0x2 = 0? f /Ox20x1) we only have to derive the upper 
triangle part of the Hessian. 
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(E.1) 
Using ao = {60.00, 0.25, 0.15} as the starting value, the first gradient vector 
becomes 


1.408 
ii 
Vel(2) _ | 459 680 (E.2) 
100,000 see 


where we have normalized by 100,000 as a matter of convenience. The Hessian 
matrix for the same point becomes 


V2,,L (a) 0.037 18.955 15.791 
T0000 =] 18.955 5860.560 4874.010 | . (E.3) 
; 15.791 4874.010 4092.800 


The next point in our sequence is then 


a0 60.00 —0.130 60.130 
a, |=| 025 |—| 0.053 | =| 0.197 |. (E.4) 
a 0.15 0.031 0.119 


The first six iterations of this problem are presented in Table E.1. A full 
treatment of numeric Newton—Raphson is beyond the scope of this appendix, 
but if we accept that the last set of gradients is “close enough to zero,” the 
estimated Cobb-Douglas production function becomes 


f (e429) = 62.0460 On. (E.5) 
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TABLE E.1 
Newton—Raphson Iterations 
Parameters Function Gradient 
Iteration ao ay a2 Value ag ay ay 


60.000 0.250 0.150 3,058,700 1.491 459.680 383.507 
60.130 0.197 0.119 816,604 0.508 156.642 130.767 
60.259 0.161 0.082 180,003 0.160 49.278 41.197 
60.410 0.149 0.038 41,179 0.041 12.733 10.684 
60.747 0.157 —O0.001 24,492 0.006 1.942 1.645 
62.946 0.160 —0.019 24,003 0.000 0.032 0.030 


oOowkwnre 


es 
E.2 Bayesian Estimation 


Applications of applied Bayesian econometric techniques have increased sig- 
nificantly over the past twenty years. As stated in the text, most of this ex- 
pansion is due to the development of empirical integration techniques. These 
technical advancements include the development of the Gibbs sampler. While 
a complete development of these techniques is beyond the scope of the current 
text, we develop a small R code to implement the simple example from the 
textbook. 

The code presented below inputs the data (i.e., the dta < — rbind(...)) com- 
mand and then draws 200 random draws from the prior distribution (ie., a 
< — rnorm(2000,mean=0.25,sd=sqrt(0.096757))). Given each of these draws, 
the code then computes the likelihood function for the sample and saves the 
simulated value, the likelihood value, and the likelihood value times the ran- 
dom draw (i.e., ali]*exp(1)). The last line of the program then computes the 
Bayesian estimation 


200 


ali x L[x|alil] 
a= 00 (E.6) 


YH lalabil 


R Code for Simple Bayesian Estimation 


dta <- rbind(0.1532,0.1620,0.1548,0.1551,0.1343,0.1534,0.1587, 
0.1501,0.1459,0.1818) 


a <- rnorm(200,mean=0.25,sd=sqrt (0.096757) ) 


for (i in 1:200) { 
for (t in 1:10) { 
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if (t == 1) 1 <- 0 
1 <- 1 -((dtal[t]-ali])*2)/(2*0.096757) } 
if (i == 1) res <- cbind(a[i],exp(1),ali]*exp(1)) else \$ 
res <- rbind(res,cbind(ali] ,exp(1),ali]*exp(1))) } 


ahat <- sum(res[,3])/sum(res[,2]) 


print (res) 
print (ahat) 


E.3 Least Absolute Deviation Estimator 


Section E.1 presented the overall mechanics of the Newton—Raphson algo- 
rithm. This algorithm is efficient if the Hessian matrix is well behaved. How- 
ever, because of the difficulty in computing the analytical Hessian and possi- 
ble instability of the Hessian for some points in the domain of some functions, 
many applications use an approximation to the analytical Hessian matrix. 
Two of these approximations are the Davidon—Fletcher—-Powel (DFP) and the 
Broyden—Fletcher—Goldfarb-Shanno (BFGS). These algorithms are typically 
referred to as conjugate gradient routines — they conjugate or build informa- 
tion from the gradients to produce an estimate of the Hessian. 

These updates start from the first-order Taylor series expansion of the 
gradient 


Val (x¢ + St) = Vill (xt) ar ie (x2) St (E.7) 


where 2x; is the current point of approximation and s; is the step or change 
in x, produced by the algorithm. Hence, the change in the gradient gives 
information about the Hessian 


Vel (at + Se) — Vef (te) = Viet (#2) se- (E.8) 


The information in the Hessian can be derived from 


8, (Val (we + 2) — Vef (2) = si Viof (#2) 82. (E.9) 


Many variants of this code are implemented in numerical codes. In R, these 
codes can be used in the function optim. The code presented below provides an 
example of this code based on the estimation of the Least Absolute Deviation 
estimator. 
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R Code for Least Absolute Deviation Estimator 


dta <- read.csv("EthanolGasoline.csv") 


bols <- lm(dta[,6]~dta[,7] + dta[,8]) 
print (summary.1m(bols) ) 


x <- as.matrix(cbind(matrix(1,nrow=nrow(dta) ,ncol=1) ,dta[,7:8])) 
lad <- function(b) { 

bvec <- rbind(b[1] ,b[2] ,b[3]) 

11 <- abs(dta[,6] - x%*%bvec) 

return(1/nrow(dta)*sum(11)) } 


bO <- cbhind(0.95,0.31,0.19) 


blad <- optim(b0,1ad) 


Glossary 


actuarial value a fair market value for a risky payoff — typically the expected 
value. 


aleatoric involving games of chance. 
asymptotic the behavior of a function f (a) as x becomes very large (i.e., 


Bayes’ theorem the theorem defining the probability of a conditional event 
based on the joint distribution and the marginal distribution of the condi- 
tioning event. 


best linear unbiased estimator the linear unbiased estimator (i.e., an es- 
timator that is a linear function of sample observations) that produces the 
smallest variance for the estimated parameter. 


binomial probability the probability function of repeated Bernoulli draws. 


Borel set an element of a o-algebra. 


composite event an event that is not a simple event. 


conditional density the probability density function — relative probability 
— for one variable conditioned or such that the outcome of another random 
variable is known. 


conditional mean the mean of a random variable or a function of random 
variables given that you know the value of another random variable. 


continuous random variable a random variable such that the probability 
of any one outcome approaches zero. 


convergence for our purposes, the tendency of sample statistics to approach 
population statistics or the tendency of unknown distributions to be arbi- 
trarily close to known distributions. 


convergence in distribution when the distribution function F’ of sequence 
of random variables {X,,} becomes infinitely close to the distribution func- 
tion of the random variable X. 
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convergence in mean square a sequence of random variables {X,,} con- 
: ee 2 
verges in mean square if limp. E [X,— X]° = 0. 


convergence in probability a sequence of random variables {X,,} con- 
verges in probability to a random variable X if P (|X, — X| <«)1— 0. 


correlation coefficient normalized covariance between two random vari- 
ables. 


discrete random variable a random variable that can result in a finite 
number of values such as a coin toss or the number of dots visible on the 
roll of a die. 


endogeneity the scenario where one of the regressors is correlated with the 
error of the regression. 


Euclidean Borel field a Borel field defined on real number space for multi- 
ple variables. 


event a subset of the sample space. 


expected utility theory the economic theory that suggests that economic 
agents choose the outcome that maximizes the expected utility of the out- 
comes. 


frequency approach probability as defined by the relative frequency or rel- 
ative count of outcomes. 


Generalized Instrumental Variables using instruments that are corre- 
lated with the residuals and imperfectly correlated with X, to remove en- 
dogeneity. 


gradient vector a vector of scalar derivatives for a multivariate function. 


heteroscedasticity the scenario where either the variances for the residuals 
are different and/or the variances are correlated across observations. 


homoscedastic the scenario where the errors are independently and identi- 
cally distributed (o?Ipy 7) . 


joint probability the probability that two or more random variables occur 
at the same time. 


kurtosis the normalized fourth moment of the distribution. The kurtosis is 
3.0 for the standard normal. 
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Lebesgue integral a broader class of integrals covering a broader group of 
functions than the Riemann sum. 


marginal distribution the function depicting the relative probability of one 
variable in a multivariate distribution regardless or independent of the value 
of the other random variables. 


nonparametric statistics that do not assume a specific distribution. 


ordinary least squares typically refers to the linear estimator that mini- 
mizes the least squares of the residual 8 = (X’X)~' (X'Y). 


probability density function a function that gives the relative probability 
for a continuous random variable. 


quantile the random variable such that k percent of the other random vari- 
ables in the sample are less than «* (k) => [* f (z) dz. 


quartiles the random variable at the sample such that 25%, 50%, and 75% 
of the random variables are smaller. 


Reimann sum the value of the integral of a function — typically the an- 
tiderivative of the function. 


Saint Petersburg paradox basically the concept of a gamble with an in- 
finite value that economic agents are only willing to pay a finite amount 
for. 


sample of convenience a sample that was not designed by the researcher 
— typically a sample that was created for a different reason. 


sample space the set of all possible outcomes. 


sigma-algebra a collection F' of subsets of a nonempty set 2 satisfying clo- 
sure under complementarity and infinite union. 


simple event an event which cannot be expressed as a union of other events. 


skewness the third central moment of the distribution. Symmetric distribu- 
tions such as the normal and Cauchy have a zero skewness. Thus, a simple 
description of the skewness is that skewness is a measure of nonsymmetry. 


statistical independence where the value of one random variable does not 
affect the probability of the value of another random variable. 
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Two Stage Least Squares a two stage procedure to eliminate the effect 
of endogeneity. In the first stage the endogenous regressor is regressed on 
the truly exogenous variables. Then the estimated values of the endogenous 
regressors are used in a standard regression equation. 


Type I error the possibility of rejecting the null hypothesis when it is cor- 
rect. 


Type II error the possibility of failing to reject the null hypothesis when it 
is incorrect. 


Uniform distribution the uniform distribution has a constant probability 
for each continuous outcome of a random variable over a range x € [a,b]. 
The standard uniform distribution is the U [0,1] such that f(a) = 1 for 
x € [0,1]. 


Working’s law of demand the conjecture that the share of the consumer’s 
expenditure on food declines as the natural logarithm of income increases. 
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Statistics 


Mathematical Statistics for Applied Econometrics covers the ba- 
sics of statistical inference in support of a subsequent course on 
classical econometrics. The book shows how mathematical statistics 
concepts form the basis of econometric formulations. It also helps 
you think about statistics as more than a toolbox of techniques. 


The text explores the unifying themes involved in quantifying sample 
information to make inferences. After developing the necessary prob- 
ability theory, it presents the concepts of estimation, such as conver- 
gence, point estimators, confidence intervals, and hypothesis tests. 
The text then shifts from a general development of mathematical sta- 
tistics to focus on applications particularly popular in economics. It 
delves into matrix analysis, linear models, and nonlinear econometric 
techniques. 


Features 

e Shows how mathematical statistics is useful in the analysis of 
economic decisions under risk and uncertainty 

e Describes statistical tools for inference, explaining the “why” 
behind statistical estimators, tests, and results 

e Provides an introduction to the symbolic computer programs 
Maxima and Mathematica®, which can be used to reduce the 
mathematical and numerical complexity of some formulations 

e Gives the R code for several applications 

e Includes summaries, review questions, and numerical exercises 
at the end of each chapter 


Avoiding a cookbook approach to econometrics, this book develops 
your theoretical understanding of statistical tools and econometric 
applications. It provides you with the foundation for further econo- 
metric studies. 
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